Skip to content

Conversation

@huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Apr 22, 2020

What changes were proposed in this pull request?

Add a paragraph for scalar function in sql getting started

Why are the changes needed?

To make 3.0 doc complete.

Does this PR introduce any user-facing change?

before:
Screen Shot 2020-04-21 at 10 11 12 PM

after:
Screen Shot 2020-04-22 at 11 49 59 PM

Screen Shot 2020-04-23 at 6 22 53 PM

How was this patch tested?

@huaxingao
Copy link
Contributor Author

@HyukjinKwon @maropu
I quickly wrote a paragraph for scalar function. It has to be scalar function because the following paragraph is for aggregation. The definition for scalar function is a bit confusing. Snowflake (https://docs.snowflake.com/en/sql-reference/functions.html) and Oracle (https://docs.oracle.com/cd/E57185_01/IRWUG/ch12s04s01.html) define scalar function as functions opposite to aggregation function, but seems IBM (https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/apsg/src/tpc/db2z_sqlscalarfn.html) uses scalar function for scalar UDF.

## Scalar Functions
(to be filled soon)

Scalar functions are functions that return a single value per row, as opposed to aggregation functions, which return a value for a group of rows. Spark SQL supports a variety of built-in scalar functions such as [Array Functions](sql-ref-functions-builtin.html#array-functions), [Map Functions](sql-ref-functions-builtin.html#map-functions), [Date and Timestamp Functions](sql-ref-functions-builtin.html#date-and-timestamp-functions), etc. It also supports [User Defined Scalar Functions](sql-ref-functions-udf-scalar.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to describe aggregate/window functions? Also, I think its better not to describe each function group name here because we might forget to update this page every time we add a new group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Scalar function instead of built-in function (because the next paragraph is for aggregations). I don't want to talk about window function because I think window function is still sort of aggregation, even thought it returns result for each row.
I can't link it to builtin function because builtin function contains aggregation function, so I link to each group of the builtin scalar functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused; the next section describes the DataFrame aggregates in SQL documents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the next paragraph talks about builtin aggregation functions and user defined aggregation functions, so I guess it's better for our paragraph to talk about scalar function than built-in function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, we need to organize both paragraphs more. For example, how about this?

Adds two categories for built-in functions in sql-ref-functions.html#built-in-functions;

### Built-in Functions
Spark SQL has some ...
 
####  Scalar Functions 
Array Functions
Map Functions
Date and Timestamp Functions
JSON Functions

#### Aggregate-like Functions
Aggregate Functions
Window Functions

Then,

## Scalar Functions
A short description about scalar built-in functions and UDFs, then we add two links to their scalar sections.

## Aggregate Functions
A short description about aggregate built-in functions and UDFs, then we add two links to their aggregate sections.

Anyway, I think both categories had better to have same granularity description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... Seems Window functions still should be put in the Scalar function category. I will double check tomorrow.

@SparkQA
Copy link

SparkQA commented Apr 22, 2020

Test build #121607 has finished for PR 28290 at commit e5e69e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2020

Test build #121656 has finished for PR 28290 at commit c831975.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 23, 2020

retest this please

* [Map Functions](sql-ref-functions-builtin.html#map-functions)
* [Date and Timestamp Functions](sql-ref-functions-builtin.html#date-and-timestamp-functions)
* [JSON Functions](sql-ref-functions-builtin.html#json-functions)
* [Window Functions](sql-ref-functions-builtin.html#window-functions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... Seems Window functions still should be put in the Scalar function category. I will double check tomorrow.

Why do you think so? Putting this in the scalar category still looks weird to me.
The other parts look ok. Also, could you file a jira, just in case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this https://issues.apache.org/jira/browse/SPARK-29458. I will use this jira.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By definition, scalar function returns a single value per row, while aggregate function returns a single value on a group of rows. Since window function returns a value per row, seems more suitable to put into scalar function category.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think vector input & output rather than scalar input & output. WDYT? @HyukjinKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how snowflake categorizes the functions: https://docs.snowflake.com/en/sql-reference-functions.html.
Snowflake lists window function in parallel with scalar function and aggregate function, but it also says "Window Functions — subset of aggregate functions that can operate on a subset of rows."

I guess it's safe to use your proposal

#### Aggregate-like Functions
Aggregate Functions
Window Functions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the last one looks fine. I think we can improve the structure later if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am a little confused... you prefer to put window function in scalar function, or aggregate-like function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregate-like function one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK... let me change it now

@SparkQA
Copy link

SparkQA commented Apr 23, 2020

Test build #121663 has finished for PR 28290 at commit c831975.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao huaxingao changed the title [MINOR][SQL][DOCS] Add a paragraph for scalar function in sql getting started [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started Apr 23, 2020
@SparkQA
Copy link

SparkQA commented Apr 24, 2020

Test build #121706 has finished for PR 28290 at commit fddc802.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 24, 2020

btw, I just noticed now that we don't have any statement about a LIKE caluse? Probably, we need to document it in sql-ref-syntax-qry-select-where.md. Are you planning to do?

$ ls sql-ref* | grep LIKE
<returns nothing>

@huaxingao
Copy link
Contributor Author

I will add LIKE clause.

quick syc-up:
I will delete

  • sql-ref-syntax-aggregation.md (already covered CUBE/ROLLUP/GROUPING in sql-ref-syntax-qry-select-groupby).
  • sql-ref-syntax-qry-select-distinct.md(We have DISTINCT in sql-ref-syntax-qry-select)

Things remaining:

  • LIKE clause
  • a general description page (sql-ref.md), add links to each section and possibly more descriptions
  • remove the 2 leading spaces in sql tables in all the examples
  • a final check and clean up.

I will leave sub-query one to Dilip for 3.1.0

@maropu
Copy link
Member

maropu commented Apr 24, 2020

Looks nice, thanks! cc: @dilipbiswal @gatorsmile

@maropu
Copy link
Member

maropu commented Apr 25, 2020

I roughly checked if the keywords are included in the documents: https://gist.github.com/maropu/d48733b7683dbb608452ee38833e2dab

It might be worth adding items below in the docs, too;

  • AFTER
  • CASE/ELSE
  • WHEN/THEN
  • IGNORE NULLS
  • LATERAL VIEW (OUTER)?
  • MAP KEYS TERMINATED BY
  • NULL DEFINED AS
  • LINES TERMINATED BY
  • ESCAPED BY
  • COLLECTION ITEMS TERMINATED BY
  • EXPLAIN LOGICAL
  • PIVOT

@srowen
Copy link
Member

srowen commented Apr 28, 2020

Is this much OK as-is or does it need more documentation?

@huaxingao
Copy link
Contributor Author

@srowen This PR is ready to be merged. @maropu was suggesting to document more keywords later on.

@srowen srowen closed this in dcc0902 Apr 28, 2020
srowen pushed a commit that referenced this pull request Apr 28, 2020
…etting started

### What changes were proposed in this pull request?
Add a paragraph for scalar function in sql getting started

### Why are the changes needed?
To make 3.0 doc complete.

### Does this PR introduce any user-facing change?
before:
<img width="870" alt="Screen Shot 2020-04-21 at 10 11 12 PM" src="https://user-images.githubusercontent.com/13592258/79943182-16d1fd00-841d-11ea-9744-9cdd58d83f81.png">

after:
<img width="865" alt="Screen Shot 2020-04-22 at 11 49 59 PM" src="https://user-images.githubusercontent.com/13592258/80068256-26704500-84f4-11ea-9845-c835927c027e.png">

<img width="1033" alt="Screen Shot 2020-04-23 at 6 22 53 PM" src="https://user-images.githubusercontent.com/13592258/80165100-82d47280-858f-11ea-8c84-1ef702cc1bff.png">

### How was this patch tested?

Closes #28290 from huaxingao/scalar.

Authored-by: Huaxin Gao <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit dcc0902)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Apr 28, 2020

Merged to master/3.0

@huaxingao
Copy link
Contributor Author

Thanks! @maropu @srowen

@huaxingao huaxingao deleted the scalar branch April 28, 2020 16:26
@maropu
Copy link
Member

maropu commented May 19, 2020

note: I've filed jira so that we wouldn't forget to add them in the documents: https://issues.apache.org/jira/browse/SPARK-31753

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants