[SPARK-19918][SQL] Use TextFileFormat in implementation of TextInputJsonDataSource #17255

HyukjinKwon · 2017-03-11T05:45:55Z

What changes were proposed in this pull request?

This PR proposes to use text datasource when Json schema inference.

This basically proposes the similar approach in #15813 If we use Dataset for initial loading when inferring the schema, there are advantages. Please refer SPARK-18362

It seems JSON one was supposed to be fixed together but taken out according to #15813

A similar problem also affects the JSON file format and this patch originally fixed that as well, but I've decided to split that change into a separate patch so as not to conflict with changes in another JSON PR.

Also, this seems affecting some functionalities because it does not use FileScanRDD. This problem is described in SPARK-19885 (but it was CSV's case).

How was this patch tested?

Existing tests should cover this and manual test by spark.read.json(path) and check the UI.

HyukjinKwon · 2017-03-11T06:21:43Z

(let me wait for the tests before cc'ing someone)

SparkQA · 2017-03-11T07:46:06Z

Test build #74375 has finished for PR 17255 at commit e2d34b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-11T08:46:55Z

cc @cloud-fan, @JoshRosen and @NathanHowell could you take a look and see if it makes sense when you have some time?

SparkQA · 2017-03-12T13:34:52Z

Test build #74402 has finished for PR 17255 at commit 1800369.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2017-03-12T17:34:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

Is the Encoders import still necessary?

NathanHowell · 2017-03-12T17:41:16Z

Would there be any additional benefit of replacing more (or all?) of the uses of RDD with the equivalent Dataset operations?

HyukjinKwon · 2017-03-12T23:40:00Z

Let me take a look for that. I think we can replace sample at least.

SparkQA · 2017-03-13T00:38:30Z

Test build #74415 has finished for PR 17255 at commit 0bdc843.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-13T04:38:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

df.queryExecution.toRDD

cloud-fan · 2017-03-13T04:39:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

can we make it return Dataset like CSV?

Let me try.

SparkQA · 2017-03-13T06:22:38Z

Test build #74426 has started for PR 17255 at commit 3135927.

HyukjinKwon · 2017-03-13T06:24:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonUtils.scala

This might be too much. I am willing to revert this back if anyone feels it is a bit odd.

I just made this just for matching it to CSVUtils that contains variants for logically same preprocessing performed on different data type (e.g. Iterator, RDD, Dataset).

SparkQA · 2017-03-13T06:37:30Z

Test build #74427 has started for PR 17255 at commit a846b26.

HyukjinKwon · 2017-03-13T06:37:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

The changes in this file basically resembles CSVDataSource. (Note that this is almost identical if #17256 is merged).

SparkQA · 2017-03-13T06:52:30Z

Test build #74429 has started for PR 17255 at commit 3e6138a.

HyukjinKwon · 2017-03-13T07:12:25Z

It seems all builds were gone failed unexpectedly. Let me restart.

SparkQA · 2017-03-13T09:12:16Z

Test build #74434 has finished for PR 17255 at commit 76dceb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class JsonDataSource extends Serializable

cloud-fan · 2017-03-14T05:28:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonInferSchema.scala

-      json
-    } else {
-      json.sample(withReplacement = false, configOptions.samplingRatio, 1)
-    }


why move the sample logic out?

Because JsonInferSchema.infer takes an RDD[T] which is the actual source to parse JSON strings. In case of whole text, it is RDD[PortableDataStream] whereas for normal one, it is RDD[UTF8String].

Thing is, it seems there is an advantage of doing the sample operation on Dataset[String] (not on RDD). So, the sample had to be applied onto Dataset[String] before converting it into RDD[UTF8String].

In a simple view:

TextInputJsonDataSource:

val json: Dataset[String] = ... val sampled: Dataset[String] = JsonUtils.sample(...) val rdd: RDD[UTF8String] = ... JsonInferSchema.infer(rdd)

WholeFileJsonDataSource:

val json: RDD[PortableDataStream] = ... val sampled: RDD[PortableDataStream] = JsonUtils.sample(...) JsonInferSchema.infer(rdd)

I could not find a good way to generalize JsonInferSchema.infer to take both Dataset and RDD as the source so that keep the logic within here with small and clean changes.

If this question is about why it use Dataset.sample instead of RDD.sample, it is suggested in #17255 (comment).

Up to my knowledge, both use the same sampler BernoulliCellSampler as replacements are disabled but for Dataset one, it generates the codes. So, I thought there might be a bit of benefits.

Strictly, maybe this is not directly related with the JIRA. I am willing to revert this change back or please let me know if you have a better idea.

Yah, perhaps the RDD->Dataset changes should be done under a separate issue. I think it can be done across the board (removing most/all RDD references) but I'm not sure what other implications it would have.

(My worry is though.. it might be a grunt work to check each really is better on Dataset..) @cloud-fan, if you are not sure of here, let me revert. Please let me know.

I think it's fine

cloud-fan · 2017-03-15T02:19:46Z

thanks, merging to master!

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of #47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., #29063, #15813, #17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon force-pushed the json-filescanrdd branch from 5e90d04 to e2d34b8 Compare March 11, 2017 05:46

HyukjinKwon changed the title ~~[SPARK-19918[SQL] Use TextFileFormat in implementation of JsonFileFormat~~ [SPARK-19918[SQL] Use TextFileFormat in implementation of TextInputJsonDataSource Mar 11, 2017

HyukjinKwon changed the title ~~[SPARK-19918[SQL] Use TextFileFormat in implementation of TextInputJsonDataSource~~ [SPARK-19918][SQL] Use TextFileFormat in implementation of TextInputJsonDataSource Mar 12, 2017

NathanHowell reviewed Mar 12, 2017

View reviewed changes

cloud-fan reviewed Mar 13, 2017

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala Outdated

Copy link

Contributor

cloud-fan Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.queryExecution.toRDD

cloud-fan reviewed Mar 13, 2017

View reviewed changes

HyukjinKwon commented Mar 13, 2017

View reviewed changes

Use TextFileFormat in implementation of JsonFileFormat

76dceb2

HyukjinKwon force-pushed the json-filescanrdd branch from 3e6138a to 76dceb2 Compare March 13, 2017 07:12

cloud-fan reviewed Mar 14, 2017

View reviewed changes

asfgit closed this in 8fb2a02 Mar 15, 2017

HyukjinKwon deleted the json-filescanrdd branch January 2, 2018 03:44

This was referenced Jul 14, 2024

[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API #47341

Closed

[SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata #47347

Closed

[SPARK-19918][SQL] Use TextFileFormat in implementation of TextInputJsonDataSource #17255

[SPARK-19918][SQL] Use TextFileFormat in implementation of TextInputJsonDataSource #17255

Uh oh!

Conversation

HyukjinKwon commented Mar 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Mar 11, 2017

Uh oh!

SparkQA commented Mar 11, 2017

Uh oh!

HyukjinKwon commented Mar 11, 2017

Uh oh!

SparkQA commented Mar 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NathanHowell commented Mar 12, 2017

Uh oh!

HyukjinKwon commented Mar 12, 2017

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

HyukjinKwon Mar 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

HyukjinKwon commented Mar 13, 2017

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Mar 11, 2017 •

edited

Loading

HyukjinKwon Mar 13, 2017 •

edited

Loading

HyukjinKwon Mar 14, 2017 •

edited

Loading

HyukjinKwon Mar 14, 2017 •

edited

Loading