-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25672][SQL] schema_of_csv() - schema inference from an example #22666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #97091 has finished for PR 22666 at commit
|
|
@HyukjinKwon May I ask you to look at this PR. |
|
@gatorsmile @cloud-fan May I ask you to look at the PR, please. |
|
Let's add from_csv first. |
Sure, I just wanted to make it ready since the changes are not overlapped so much. |
|
Test build #97314 has finished for PR 22666 at commit
|
|
Test build #97318 has finished for PR 22666 at commit
|
|
Test build #97335 has finished for PR 22666 at commit
|
|
Woah .. let me resolve the conflicts tonight. |
28862a5 to
cd7e2ab
Compare
|
Test build #97582 has finished for PR 22666 at commit
|
|
Test build #97583 has finished for PR 22666 at commit
|
|
This is a WIP - there are a hell of a lot conflicts. Let me resolve, review and fix some issues. |
|
Test build #97584 has finished for PR 22666 at commit
|
|
Test build #97585 has finished for PR 22666 at commit
|
|
Test build #97586 has finished for PR 22666 at commit
|
|
Test build #97588 has finished for PR 22666 at commit
|
|
Test build #97595 has finished for PR 22666 at commit
|
|
Test build #97593 has finished for PR 22666 at commit
|
|
Test build #97594 has finished for PR 22666 at commit
|
|
Should be ready for a look now. Would you mind taking a look please @cloud-fan and @gatorsmile? |
|
Test build #97607 has finished for PR 22666 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need asInstanceOf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main constructor of SchemaOfCsv accepts Map[String, String] directly, shall we use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we have an API with scala Map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema_of_json also has only Java specific (I actually suggested to minimise exposed functions) since Java specific one can be used in Scala side but Scala specific can't be used in Java side.
|
Test build #97636 has finished for PR 22666 at commit
|
bd79d87 to
49bac0e
Compare
|
retest this please |
|
Test build #97654 has finished for PR 22666 at commit
|
3ef2503 to
3aa79d4
Compare
|
Test build #98112 has finished for PR 22666 at commit
|
|
retest this please |
|
Test build #98118 has finished for PR 22666 at commit
|
|
retest this please |
|
Test build #98125 has finished for PR 22666 at commit
|
| } | ||
|
|
||
| def evalTypeExpr(exp: Expression): DataType = exp match { | ||
| case Literal(s, StringType) => DataType.fromDDL(s.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
if (expr.isFoldable && expr.dataType == StringType) {
DataType.fromDDL(expr.eval().asInstanceOf[UTF8String].toString)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to update https://github.com/apache/spark/pull/22666/files#diff-5321c01e95bffc4413c5f3457696213eR157
in case the constant folding rule is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's what I initially thought that we should allow constant-foldable expressions as well but just decided to follow the initial intent - literal only support. I wasn't also sure about when we would need constant folding to construct a JSON example because I suspected that's usually copied and pasted from, for instance, a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, a column with CSV string may be a result of string functions. So, you could just invoke the functions with an particular inputs. Currently, we force people to materialize an example and copy-past it to schema_of_csv(). That could cause maintainability issues, so, users should keep in sync the example in schema_of_csv() with the code which forms CSV column.
I prepared the PR #27777 to avoid the restriction which is not necessary from my point of view.
| CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES ('1,abc', 'a'); | ||
| SELECT schema_of_csv(csvField) FROM csvTable; | ||
| -- Clean up | ||
| DROP VIEW IF EXISTS csvTable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we don't need to clean up temp views. The golden file test is run with a fresh session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see but isn't it still better to explicitly clean tables up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea we need to clean up tables, as they are permanent.
Actually I'm fine with it, as we clean up temp views in a lot of golden files. We can have another PR to remove these temp view clean up.
|
Thanks, @cloud-fan. The change looks good to me from my side. Let me take another look for this and leave a sign-off (which means a sign-off for @MaxGekk's code changes) |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my side
|
retest this please |
|
Test build #98307 has finished for PR 22666 at commit
|
|
retest this please |
|
Test build #98313 has finished for PR 22666 at commit
|
|
Merged to master. |
|
Ahhhhh no I am sorry @MaxGekk. I made the primary author as me mistakenly. It showed my email first, and I just mistakenly copied and pasted as usual. Looks the number of commits affects the name appearing for |
|
Argh, sorry, it was my mistake. |
|
@HyukjinKwon Never mind. Thank you for your work on the PR. |
## What changes were proposed in this pull request?
In the PR, I propose to add new function - *schema_of_csv()* which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example:
```sql
select schema_of_csv('1|abc', map('delimiter', '|'))
```
```
struct<_c0:int,_c1:string>
```
## How was this patch tested?
Added new tests to `CsvFunctionsSuite`, `CsvExpressionsSuite` and SQL tests to `csv-functions.sql`
Closes apache#22666 from MaxGekk/schema_of_csv-function.
Lead-authored-by: hyukjinkwon <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose to add new function - schema_of_csv() which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example:
How was this patch tested?
Added new tests to
CsvFunctionsSuite,CsvExpressionsSuiteand SQL tests tocsv-functions.sql