[SPARK-25672][SQL] schema_of_csv() - schema inference from an example #22666

MaxGekk · 2018-10-07T20:05:41Z

What changes were proposed in this pull request?

In the PR, I propose to add new function - schema_of_csv() which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example:

select schema_of_csv('1|abc', map('delimiter', '|'))

struct<_c0:int,_c1:string>

How was this patch tested?

Added new tests to CsvFunctionsSuite, CsvExpressionsSuite and SQL tests to csv-functions.sql

SparkQA · 2018-10-08T00:21:46Z

Test build #97091 has finished for PR 22666 at commit 5fb17fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-10-10T12:06:02Z

@HyukjinKwon May I ask you to look at this PR.

MaxGekk · 2018-10-11T08:39:47Z

@gatorsmile @cloud-fan May I ask you to look at the PR, please.

HyukjinKwon · 2018-10-12T01:28:35Z

Let's add from_csv first.

MaxGekk · 2018-10-12T08:51:59Z

Let's add from_csv first.

Sure, I just wanted to make it ready since the changes are not overlapped so much.

SparkQA · 2018-10-12T21:40:52Z

Test build #97314 has finished for PR 22666 at commit c038aaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-12T23:34:16Z

Test build #97318 has finished for PR 22666 at commit 0c5e955.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-13T11:36:03Z

Test build #97335 has finished for PR 22666 at commit 28862a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-17T01:35:18Z

Woah .. let me resolve the conflicts tonight.

SparkQA · 2018-10-19T03:45:33Z

Test build #97582 has finished for PR 22666 at commit cd7e2ab.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-19T03:50:10Z

Test build #97583 has finished for PR 22666 at commit 80d6759.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-19T03:54:03Z

This is a WIP - there are a hell of a lot conflicts. Let me resolve, review and fix some issues.

SparkQA · 2018-10-19T03:55:16Z

Test build #97584 has finished for PR 22666 at commit 4869b76.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-19T04:00:02Z

Test build #97585 has finished for PR 22666 at commit c9df3ab.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UnivocityParserSuite extends SparkFunSuite

SparkQA · 2018-10-19T06:48:24Z

Test build #97586 has finished for PR 22666 at commit 6cbc7fb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-19T07:05:01Z

Test build #97588 has finished for PR 22666 at commit 1b86834.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-19T07:05:01Z

Test build #97595 has finished for PR 22666 at commit 8763494.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-19T07:05:01Z

Test build #97593 has finished for PR 22666 at commit aead783.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-19T07:05:02Z

Test build #97594 has finished for PR 22666 at commit 1e90261.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-19T09:18:57Z

Should be ready for a look now. Would you mind taking a look please @cloud-fan and @gatorsmile?

SparkQA · 2018-10-19T12:19:36Z

Test build #97607 has finished for PR 22666 at commit 41c39db.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-19T13:09:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala

why do we need asInstanceOf?

cloud-fan · 2018-10-19T13:10:37Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala

the main constructor of SchemaOfCsv accepts Map[String, String] directly, shall we use that?

cloud-fan · 2018-10-19T13:11:09Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

shall we have an API with scala Map?

schema_of_json also has only Java specific (I actually suggested to minimise exposed functions) since Java specific one can be used in Scala side but Scala specific can't be used in Java side.

SparkQA · 2018-10-20T06:27:17Z

Test build #97636 has finished for PR 22666 at commit bd79d87.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-20T18:05:27Z

retest this please

SparkQA · 2018-10-20T19:33:17Z

Test build #97654 has finished for PR 22666 at commit 49bac0e.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-27T03:44:47Z

Test build #98112 has finished for PR 22666 at commit 3aa79d4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-27T05:58:55Z

retest this please

SparkQA · 2018-10-27T07:05:02Z

Test build #98118 has finished for PR 22666 at commit 3aa79d4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-27T09:57:28Z

retest this please

SparkQA · 2018-10-27T13:26:44Z

Test build #98125 has finished for PR 22666 at commit 3aa79d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-29T03:01:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala

+  }
+
+  def evalTypeExpr(exp: Expression): DataType = exp match {
+    case Literal(s, StringType) => DataType.fromDDL(s.toString)


how about

if (expr.isFoldable && expr.dataType == StringType) { DataType.fromDDL(expr.eval().asInstanceOf[UTF8String].toString) }

we also need to update https://github.com/apache/spark/pull/22666/files#diff-5321c01e95bffc4413c5f3457696213eR157

in case the constant folding rule is disabled.

Yup, that's what I initially thought that we should allow constant-foldable expressions as well but just decided to follow the initial intent - literal only support. I wasn't also sure about when we would need constant folding to construct a JSON example because I suspected that's usually copied and pasted from, for instance, a file.

For example, a column with CSV string may be a result of string functions. So, you could just invoke the functions with an particular inputs. Currently, we force people to materialize an example and copy-past it to schema_of_csv(). That could cause maintainability issues, so, users should keep in sync the example in schema_of_csv() with the code which forms CSV column.

I prepared the PR #27777 to avoid the restriction which is not necessary from my point of view.

cloud-fan · 2018-10-29T03:09:19Z

sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql

+CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES ('1,abc', 'a');
+SELECT schema_of_csv(csvField) FROM csvTable;
+-- Clean up
+DROP VIEW IF EXISTS csvTable;


actually we don't need to clean up temp views. The golden file test is run with a fresh session.

I see but isn't it still better to explicitly clean tables up?

yea we need to clean up tables, as they are permanent.

Actually I'm fine with it, as we clean up temp views in a lot of golden files. We can have another PR to remove these temp view clean up.

HyukjinKwon · 2018-10-30T02:15:20Z

Thanks, @cloud-fan. The change looks good to me from my side. Let me take another look for this and leave a sign-off (which means a sign-off for @MaxGekk's code changes)

HyukjinKwon

LGTM from my side

HyukjinKwon · 2018-10-31T09:16:32Z

retest this please

SparkQA · 2018-10-31T10:56:18Z

Test build #98307 has finished for PR 22666 at commit 3aa79d4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-31T11:08:26Z

retest this please

SparkQA · 2018-10-31T14:37:48Z

Test build #98313 has finished for PR 22666 at commit 3aa79d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-01T01:13:21Z

Merged to master.

HyukjinKwon · 2018-11-01T01:17:48Z

Ahhhhh no I am sorry @MaxGekk. I made the primary author as me mistakenly. It showed my email first, and I just mistakenly copied and pasted as usual.

=== Pull Request #22666 ===
title	[SPARK-25672][SQL] schema_of_csv() - schema inference from an example
source	MaxGekk/schema_of_csv-function
target	master
url	https://api.github.com/repos/apache/spark/pulls/22666

Proceed with merging pull request #22666? (y/n): y
git fetch apache-github pull/22666/head:PR_TOOL_MERGE_PR_22666
From https://github.com/apache/spark
 * [new ref]                 refs/pull/22666/head -> PR_TOOL_MERGE_PR_22666
git fetch apache master:PR_TOOL_MERGE_PR_22666_MASTER
remote: Counting objects: 303, done.
remote: Compressing objects: 100% (153/153), done.
remote: Total 209 (delta 91), reused 0 (delta 0)
Receiving objects: 100% (209/209), 91.89 KiB | 445.00 KiB/s, done.
Resolving deltas: 100% (91/91), completed with 65 local objects.
From https://git-wip-us.apache.org/repos/asf/spark
 * [new branch]              master     -> PR_TOOL_MERGE_PR_22666_MASTER
   57eddc7182e..c5ef477d2f6  master     -> apache/master
git checkout PR_TOOL_MERGE_PR_22666_MASTER
Switched to branch 'PR_TOOL_MERGE_PR_22666_MASTER'
['git', 'merge', 'PR_TOOL_MERGE_PR_22666', '--squash']
Automatic merge went well; stopped before committing as requested
['git', 'log', 'HEAD..PR_TOOL_MERGE_PR_22666', '--pretty=format:%an <%ae>']
Enter primary author in the format of "name <email>" [hyukjinkwon <[email protected]>]: hyukjinkwon <[email protected]>
['git', 'log', 'HEAD..PR_TOOL_MERGE_PR_22666', '--pretty=format:%h [%an] %s']

Looks the number of commits affects the name appearing for Enter primary author in the format of "name <email>".

HyukjinKwon · 2018-11-01T01:20:07Z

Argh, sorry, it was my mistake.

MaxGekk · 2018-11-01T07:25:14Z

@HyukjinKwon Never mind. Thank you for your work on the PR.

## What changes were proposed in this pull request? In the PR, I propose to add new function - *schema_of_csv()* which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example: ```sql select schema_of_csv('1|abc', map('delimiter', '|')) ``` ``` struct<_c0:int,_c1:string> ``` ## How was this patch tested? Added new tests to `CsvFunctionsSuite`, `CsvExpressionsSuite` and SQL tests to `csv-functions.sql` Closes apache#22666 from MaxGekk/schema_of_csv-function. Lead-authored-by: hyukjinkwon <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

MaxGekk mentioned this pull request Oct 12, 2018

[SPARK-25393][SQL] Adding new function from_csv() #22379

Closed

HyukjinKwon force-pushed the schema_of_csv-function branch from 28862a5 to cd7e2ab Compare October 19, 2018 03:42

cloud-fan reviewed Oct 19, 2018

View reviewed changes

HyukjinKwon force-pushed the schema_of_csv-function branch from bd79d87 to 49bac0e Compare October 20, 2018 15:56

HyukjinKwon added 8 commits October 27, 2018 09:03

Fix doctest examples to be more uesful

21e2dc4

Deduplicate and fix python examples to be more useful

6b1f408

literals only

e343d4d

Address comments

26fb354

sync tests

b068d9f

match to schema_of_json

4696cdd

updates tests

b8c6c94

Resolve conflicts

3aa79d4

HyukjinKwon force-pushed the schema_of_csv-function branch from 3ef2503 to 3aa79d4 Compare October 27, 2018 02:15

cloud-fan reviewed Oct 29, 2018

View reviewed changes

HyukjinKwon approved these changes Oct 31, 2018

View reviewed changes

cloud-fan approved these changes Oct 31, 2018

View reviewed changes

asfgit closed this in c9667af Nov 1, 2018

MaxGekk deleted the schema_of_csv-function branch August 17, 2019 13:35

[SPARK-25672][SQL] schema_of_csv() - schema inference from an example #22666

[SPARK-25672][SQL] schema_of_csv() - schema inference from an example #22666

Uh oh!

Conversation

MaxGekk commented Oct 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

MaxGekk commented Oct 10, 2018

Uh oh!

MaxGekk commented Oct 11, 2018

Uh oh!

HyukjinKwon commented Oct 12, 2018

Uh oh!

MaxGekk commented Oct 12, 2018

Uh oh!

SparkQA commented Oct 12, 2018

Uh oh!

SparkQA commented Oct 12, 2018

Uh oh!

SparkQA commented Oct 13, 2018

Uh oh!

HyukjinKwon commented Oct 17, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

HyukjinKwon commented Oct 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

HyukjinKwon commented Oct 19, 2018

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 20, 2018

Uh oh!

HyukjinKwon commented Oct 20, 2018

Uh oh!

SparkQA commented Oct 20, 2018

Uh oh!

SparkQA commented Oct 27, 2018

Uh oh!

HyukjinKwon commented Oct 27, 2018

Uh oh!

SparkQA commented Oct 27, 2018

Uh oh!

HyukjinKwon commented Oct 27, 2018

Uh oh!

SparkQA commented Oct 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

HyukjinKwon commented Oct 19, 2018 •

edited

Loading

HyukjinKwon commented Nov 1, 2018 •

edited

Loading