Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Apr 15, 2020

What changes were proposed in this pull request?

This PR intends to add a Python script to generates a SQL document for built-in functions and the document in SQL references.

Why are the changes needed?

To make SQL references complete.

Does this PR introduce any user-facing change?

Yes;

a
b
c

How was this patch tested?

Manually checked and added tests.

@maropu
Copy link
Member Author

maropu commented Apr 15, 2020

just like this? @HyukjinKwon @huaxingao

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121314 has finished for PR 28224 at commit fc9f2d5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2020

Test build #121318 has finished for PR 28224 at commit ef6391e.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Spark SQL has some categories of frequently-used built-in functions for aggregtion, arrays/maps, date/timestamp, and JSON data.
This subsection presents the usages and descriptions of these functions.

* [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: order of bullets/words in the above statement is different from that in sql-ref-functions-builtin.html. .e. the alphabetical order in sql-ref-functions-builtin.html.

I am neutral on the order (current or consistent among them).

Copy link
Member Author

@maropu maropu Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, actually, I sorted them by similar functionality, e.g., array and map. On the other hand, in sql-ref-functions-builtin.html, the script sorts them just by group name.... Yea, we can update the script for sorting them in the same order though, I just want to keep it simple.

@huaxingao
Copy link
Contributor

Looks really nice! I love this feature! Thanks a lot! @maropu @HyukjinKwon

@huaxingao
Copy link
Contributor

A few comments:

  1. Can we still have window functions here? In SQL Syntax section, I am focusing on SQL syntax, I don't have a list of window functions and their descriptions there.
  2. Can we list other function categories as well? such as Math functions, string functions, sorting functions, udf functions? Also, functions.scala puts Array functions in Collection functions group. Shall we use Collection functions instead of Array functions?
  3. In the left side menu, can we have Built-in Functions and User Defined Functions, and put the three UDFs as subcategories?

@maropu
Copy link
Member Author

maropu commented Apr 16, 2020

Can we still have window functions here? In SQL Syntax section, I am focusing on SQL syntax, I don't have a list of window functions and their descriptions there.

Yea, we can do. I'll update later. Thanks!

Can we list other function categories as well? such as Math functions, string functions, sorting functions, udf functions?

Yea, we can do, too. But, in this PR, I just want to focus on the python code and the basic set of built-in functions. As for the other function groups, I think we need to discuss more about which functions should be listed, or not.

Also, functions.scala puts Array functions in Collection functions group. Shall we use Collection functions instead of Array functions?

Yea, I noticed that, but the collection group includes array, map, and JSON functions. I personally think splitting it into the three groups looks easy-to-serach for users. WDYT?

In the left side menu, can we have Built-in Functions and User Defined Functions, and put the three UDFs as subcategories?

The original one had multiple document pages for the functions, so it had one more hierarchy. But, the script generates a single page for them. So, I think we don't need the hierarchy. If we need to improve the structure, feel free to do it as follow-up.

@maropu
Copy link
Member Author

maropu commented Apr 16, 2020

Can we still have window functions here? In SQL Syntax section, I am focusing on SQL syntax, I don't have a list of window functions and their descriptions there.

Yea, we can do. I'll update later. Thanks!

Ur, I noticed now that all the window functions have no example tag, e.g., https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L428

We need to add examples there before generating the doc.

@huaxingao
Copy link
Contributor

We need to add examples there before generating the doc.

I guess no window functions examples there because these examples are a little complicated. Is it OK we just generate the doc without examples?

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121341 has finished for PR 28224 at commit 3382bf5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 16, 2020

Looks fine now. But, adding a simpler example looks good if possible. Updated in b9407ca

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 16, 2020

Nice, thanks you so much @maropu. I have some ideas on my mind for example, not including the generated MD files into the Github repo. We can of course include them but it might be better to do it separately because we're already not including md files in SQL built-in functions.

Do you mind if I manually push some changes into your branch (and to your PR)? Of course, feel free to edit as you want after that. My changes might be mostly just my preference or nits I guess.

@maropu
Copy link
Member Author

maropu commented Apr 16, 2020

Nice, thanks you so much @maropu. I have some ideas on my mind for example, not including the generated MD files into the Github repo. We can of course include them but it might be better to do it separately because we're already not including md files in SQL built-in functions.

Yea, the idea look reasonable.

Do you mind if I manually push some changes into your branch (and to your PR)? Of course, feel free to edit as you want after that. My changes might be mostly just my preference or nits I guess.

Of course not! Both is ok to me.

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121346 has finished for PR 28224 at commit b9407ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121350 has finished for PR 28224 at commit d66a9aa.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 16, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121355 has finished for PR 28224 at commit d66a9aa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2020

Test build #121416 has finished for PR 28224 at commit 7e9ebdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2020

Test build #121515 has finished for PR 28224 at commit f58335d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Apr 20, 2020

Looks pretty nice, thanks for the update, @HyukjinKwon ! LGTM except for two comment.

@nchammas
Copy link
Contributor

Silly question: What's the relationship of the SQL function docs being added in this PR to the SQL function docs we already have here?

@maropu
Copy link
Member Author

maropu commented Apr 20, 2020

Have you checked the discussion in #28170 (comment) ? The document in api/sql just includes a full flatten list of the functions. So, for popular built-in functions (e.g., aggregate, array, and date/time functions), we're planning to provide the categorized document page only for them.

@maropu
Copy link
Member Author

maropu commented Apr 21, 2020

Updated for my two comments in f302c91. Could you check again for final sign-off? @HyukjinKwon

@HyukjinKwon
Copy link
Member

Silly question: What's the relationship of the SQL function docs being added in this PR to the SQL function docs we already have here?

Actually .. I target to remove that SQL function docs in long run. For the short term, #28224 (comment) is true - we don't list all expressions yet.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

I am going to merge this because the documentation generation build is passed in Github Actions. There's nothing to check with PR builder.

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0.

HyukjinKwon added a commit that referenced this pull request Apr 21, 2020
…ilt-in functions

This PR intends to add a Python script to generates a SQL document for built-in functions and the document in SQL references.

To make SQL references complete.

Yes;

![a](https://user-images.githubusercontent.com/692303/79406712-c39e1b80-7fd2-11ea-8b85-9f9cbb6efed3.png)
![b](https://user-images.githubusercontent.com/692303/79320526-eb46a280-7f44-11ea-8639-90b1fb2b8848.png)
![c](https://user-images.githubusercontent.com/692303/79320707-3365c500-7f45-11ea-9984-69ffe800fb87.png)

Manually checked and added tests.

Closes #28224 from maropu/SPARK-31429.

Lead-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit e42dbe7)
Signed-off-by: HyukjinKwon <[email protected]>
@HyukjinKwon
Copy link
Member

@huaxingao do you want to separate pages, and address the leftover comments at #28224 (comment)? Should be pretty straightforward to do it now.

@SparkQA
Copy link

SparkQA commented Apr 21, 2020

Test build #121550 has finished for PR 28224 at commit f302c91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

### Aggregate Functions
{% include_relative generated-agg-funcs-table.html %}
#### Examples
{% include_relative generated-agg-funcs-examples.html %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we put all the examples in one section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we currently put examples per group in one section like this;

### Aggrgate Functions
<A list of aggregate functions>
#### Examples
<All examples of the aggregate functions listed above>

### Window Functions
<A list of aggregate functions>
#### Examples
<All examples of the window functions listed above>
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to put everything about one function (signature, desc, examples, etc.) together. Did we decide to do this due to technical difficulties?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it. I just a bit rushed to merge this to keep this PR narrow-scoped as a base work.
Nothing is completely pinned at this moment.

As an easy way, we can refer other DBMSs documentations and mimic I guess.

Copy link
Member Author

@maropu maropu Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so and we can improve the structure by follow-up activities. But, I'm not sure the improvement should be applied at the 3.0 release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we can improve it in 3.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks, I'll file jira later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao
Copy link
Contributor

I just saw this

Screen Shot 2020-04-21 at 6 55 37 PM

Do we need this scalar function sections in getting started page? We either need to delete this section or write a short description and link it to the built-in functions page.

@maropu
Copy link
Member Author

maropu commented Apr 22, 2020

How about Scala Functions -> Built-in Functions, then adding a short description and a link to sql-ref-functions.html#built-in-functions?

@huaxingao
Copy link
Contributor

Sounds good. I will fix this.

@HyukjinKwon
Copy link
Member

Thanks @huaxingao!

HyukjinKwon pushed a commit that referenced this pull request Apr 24, 2020
### What changes were proposed in this pull request?

This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows;

 - Added a new test suite named `ExpressionInfoSuite`
 - To improve test coverage, added a test for error handling in `ExpressionInfoSuite`
 - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite`
 - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite`
 - Added a comment in `ExpressionInfoSuite` (followup of #28224)

### Why are the changes needed?

To improve test suites/coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28308 from maropu/SPARK-31526.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Apr 24, 2020
### What changes were proposed in this pull request?

This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows;

 - Added a new test suite named `ExpressionInfoSuite`
 - To improve test coverage, added a test for error handling in `ExpressionInfoSuite`
 - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite`
 - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite`
 - Added a comment in `ExpressionInfoSuite` (followup of #28224)

### Why are the changes needed?

To improve test suites/coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28308 from maropu/SPARK-31526.

Authored-by: Takeshi Yamamuro <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 42f496f)
Signed-off-by: HyukjinKwon <[email protected]>
@gatorsmile
Copy link
Member

Thank you, this looks great!

### JSON Functions
{% include_relative generated-json-funcs-table.html %}
#### Examples
{% include_relative generated-agg-funcs-examples.html %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generated-agg-funcs-examples.html -> generated-json-funcs-examples.html ?

Copy link
Member Author

@maropu maropu May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea... @huaxingao @dilipbiswal Could you include this fix in your open PRs? #28451
or #28433

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @huaxingao

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, it was my mistake I guess. Thanks guys,

@gatorsmile
Copy link
Member

So far, it just covers a subset. Could we add more categories to cover more built-in functions listed in https://spark.apache.org/docs/latest/api/sql/index.html? For your reference, Presto has the complete list https://prestodb.io/docs/current/functions.html cc @huaxingao @maropu

@maropu
Copy link
Member Author

maropu commented Oct 12, 2020

Filed in https://issues.apache.org/jira/browse/SPARK-33124
Do you have time to work on this? @tanelk (If nobody takes this, I'll do it later.)

@tanelk
Copy link
Contributor

tanelk commented Oct 13, 2020

Yeah, I can take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants