Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 7, 2018

What changes were proposed in this pull request?

In the PR, I propose to add new function - schema_of_csv() which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example:

select schema_of_csv('1|abc', map('delimiter', '|'))
struct<_c0:int,_c1:string>

How was this patch tested?

Added new tests to CsvFunctionsSuite, CsvExpressionsSuite and SQL tests to csv-functions.sql

@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97091 has finished for PR 22666 at commit 5fb17fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 10, 2018

@HyukjinKwon May I ask you to look at this PR.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 11, 2018

@gatorsmile @cloud-fan May I ask you to look at the PR, please.

@HyukjinKwon
Copy link
Member

Let's add from_csv first.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 12, 2018

Let's add from_csv first.

Sure, I just wanted to make it ready since the changes are not overlapped so much.

@SparkQA
Copy link

SparkQA commented Oct 12, 2018

Test build #97314 has finished for PR 22666 at commit c038aaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2018

Test build #97318 has finished for PR 22666 at commit 0c5e955.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2018

Test build #97335 has finished for PR 22666 at commit 28862a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Woah .. let me resolve the conflicts tonight.

@HyukjinKwon HyukjinKwon force-pushed the schema_of_csv-function branch from 28862a5 to cd7e2ab Compare October 19, 2018 03:42
@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97582 has finished for PR 22666 at commit cd7e2ab.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97583 has finished for PR 22666 at commit 80d6759.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 19, 2018

This is a WIP - there are a hell of a lot conflicts. Let me resolve, review and fix some issues.

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97584 has finished for PR 22666 at commit 4869b76.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97585 has finished for PR 22666 at commit c9df3ab.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class UnivocityParserSuite extends SparkFunSuite

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97586 has finished for PR 22666 at commit 6cbc7fb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97588 has finished for PR 22666 at commit 1b86834.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97595 has finished for PR 22666 at commit 8763494.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97593 has finished for PR 22666 at commit aead783.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97594 has finished for PR 22666 at commit 1e90261.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Should be ready for a look now. Would you mind taking a look please @cloud-fan and @gatorsmile?

@SparkQA
Copy link

SparkQA commented Oct 19, 2018

Test build #97607 has finished for PR 22666 at commit 41c39db.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need asInstanceOf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main constructor of SchemaOfCsv accepts Map[String, String] directly, shall we use that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we have an API with scala Map?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema_of_json also has only Java specific (I actually suggested to minimise exposed functions) since Java specific one can be used in Scala side but Scala specific can't be used in Java side.

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97636 has finished for PR 22666 at commit bd79d87.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon force-pushed the schema_of_csv-function branch from bd79d87 to 49bac0e Compare October 20, 2018 15:56
@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97654 has finished for PR 22666 at commit 49bac0e.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon force-pushed the schema_of_csv-function branch from 3ef2503 to 3aa79d4 Compare October 27, 2018 02:15
@SparkQA
Copy link

SparkQA commented Oct 27, 2018

Test build #98112 has finished for PR 22666 at commit 3aa79d4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 27, 2018

Test build #98118 has finished for PR 22666 at commit 3aa79d4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 27, 2018

Test build #98125 has finished for PR 22666 at commit 3aa79d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

def evalTypeExpr(exp: Expression): DataType = exp match {
case Literal(s, StringType) => DataType.fromDDL(s.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

if (expr.isFoldable && expr.dataType == StringType) {
  DataType.fromDDL(expr.eval().asInstanceOf[UTF8String].toString)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to update https://github.com/apache/spark/pull/22666/files#diff-5321c01e95bffc4413c5f3457696213eR157

in case the constant folding rule is disabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's what I initially thought that we should allow constant-foldable expressions as well but just decided to follow the initial intent - literal only support. I wasn't also sure about when we would need constant folding to construct a JSON example because I suspected that's usually copied and pasted from, for instance, a file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, a column with CSV string may be a result of string functions. So, you could just invoke the functions with an particular inputs. Currently, we force people to materialize an example and copy-past it to schema_of_csv(). That could cause maintainability issues, so, users should keep in sync the example in schema_of_csv() with the code which forms CSV column.

I prepared the PR #27777 to avoid the restriction which is not necessary from my point of view.

CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES ('1,abc', 'a');
SELECT schema_of_csv(csvField) FROM csvTable;
-- Clean up
DROP VIEW IF EXISTS csvTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we don't need to clean up temp views. The golden file test is run with a fresh session.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see but isn't it still better to explicitly clean tables up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we need to clean up tables, as they are permanent.

Actually I'm fine with it, as we clean up temp views in a lot of golden files. We can have another PR to remove these temp view clean up.

@HyukjinKwon
Copy link
Member

Thanks, @cloud-fan. The change looks good to me from my side. Let me take another look for this and leave a sign-off (which means a sign-off for @MaxGekk's code changes)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 31, 2018

Test build #98307 has finished for PR 22666 at commit 3aa79d4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 31, 2018

Test build #98313 has finished for PR 22666 at commit 3aa79d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 1, 2018

Ahhhhh no I am sorry @MaxGekk. I made the primary author as me mistakenly. It showed my email first, and I just mistakenly copied and pasted as usual.

=== Pull Request #22666 ===
title	[SPARK-25672][SQL] schema_of_csv() - schema inference from an example
source	MaxGekk/schema_of_csv-function
target	master
url	https://api.github.com/repos/apache/spark/pulls/22666

Proceed with merging pull request #22666? (y/n): y
git fetch apache-github pull/22666/head:PR_TOOL_MERGE_PR_22666
From https://github.com/apache/spark
 * [new ref]                 refs/pull/22666/head -> PR_TOOL_MERGE_PR_22666
git fetch apache master:PR_TOOL_MERGE_PR_22666_MASTER
remote: Counting objects: 303, done.
remote: Compressing objects: 100% (153/153), done.
remote: Total 209 (delta 91), reused 0 (delta 0)
Receiving objects: 100% (209/209), 91.89 KiB | 445.00 KiB/s, done.
Resolving deltas: 100% (91/91), completed with 65 local objects.
From https://git-wip-us.apache.org/repos/asf/spark
 * [new branch]              master     -> PR_TOOL_MERGE_PR_22666_MASTER
   57eddc7182e..c5ef477d2f6  master     -> apache/master
git checkout PR_TOOL_MERGE_PR_22666_MASTER
Switched to branch 'PR_TOOL_MERGE_PR_22666_MASTER'
['git', 'merge', 'PR_TOOL_MERGE_PR_22666', '--squash']
Automatic merge went well; stopped before committing as requested
['git', 'log', 'HEAD..PR_TOOL_MERGE_PR_22666', '--pretty=format:%an <%ae>']
Enter primary author in the format of "name <email>" [hyukjinkwon <[email protected]>]: hyukjinkwon <[email protected]>
['git', 'log', 'HEAD..PR_TOOL_MERGE_PR_22666', '--pretty=format:%h [%an] %s']

Looks the number of commits affects the name appearing for Enter primary author in the format of "name <email>".

@asfgit asfgit closed this in c9667af Nov 1, 2018
@HyukjinKwon
Copy link
Member

Argh, sorry, it was my mistake.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 1, 2018

@HyukjinKwon Never mind. Thank you for your work on the PR.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In the PR, I propose to add new function - *schema_of_csv()* which infers schema of CSV string literal. The result of the function is a string containing a schema in DDL format. For example:

```sql
select schema_of_csv('1|abc', map('delimiter', '|'))
```
```
struct<_c0:int,_c1:string>
```

## How was this patch tested?

Added new tests to `CsvFunctionsSuite`, `CsvExpressionsSuite` and SQL tests to `csv-functions.sql`

Closes apache#22666 from MaxGekk/schema_of_csv-function.

Lead-authored-by: hyukjinkwon <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
@MaxGekk MaxGekk deleted the schema_of_csv-function branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants