Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 27, 2018

What changes were proposed in this pull request?

While reading CSV or JSON files, DataFrameReader's options are converted to Hadoop's parameters, for example there:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302

but the options are not propagated to Text datasource on schema inferring, for instance:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188

The PR proposes propagation of user's options to Text datasource on scheme inferring in similar way as user's options are converted to Hadoop parameters if schema is specified.

How was this patch tested?

The changes were tested manually by using https://github.com/twitter/hadoop-lzo:

hadoop-lzo> mvn clean package
hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar

Create 2 test files in JSON and CSV format and compress them:

$ cat test.csv
col1|col2
a|1
$ lzop test.csv
$ cat test.json
{"col1":"a","col2":1}
$ lzop test.json

Run spark-shell with hadoop-lzo:

bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar

reading compressed CSV and JSON without schema:

spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show()
+----+----+
|col1|col2|
+----+----+
|   a|   1|
+----+----+
spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("multiLine", true).json("test.json.lzo").printSchema
root
 |-- col1: string (nullable = true)
 |-- col2: long (nullable = true)

@SparkQA
Copy link

SparkQA commented Apr 27, 2018

Test build #89929 has finished for PR 21182 at commit 8a8ff3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 27, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89942 has finished for PR 21182 at commit 8a8ff3f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89953 has finished for PR 21182 at commit 1fa871d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90010 has finished for PR 21182 at commit 9f55aa8.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 2, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90047 has finished for PR 21182 at commit 9f55aa8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @gengliangwang

factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars)
}

val textOptions = ListMap(parameters.toList: _*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use parameters instead of constructing a new map.

Copy link
Member Author

@MaxGekk MaxGekk May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I constructed new map is I got an exception like the textOptions value is not serializable on one test suite. ListMap extends Serializable and allows to eliminate such exceptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, CaseInsensitiveMap.. with Serializable so I think it was ok. I didn't try it and run tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I faced to the problem on FileStreamSourceSuite.read new files in partitioned table with globbing, should not read partition data. Let me check it again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fails on serialization of the textOptions value because the former one refers to transient parameters. I add the @transient annotation to textOptions (it should be safe because textOptions is using on the driver only and shouldn't be serialized).

@SparkQA
Copy link

SparkQA commented May 5, 2018

Test build #90247 has finished for PR 21182 at commit 49012a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars)
}

@transient val textOptions = parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to access parameters like this, why not just remove private on it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation for creating of the separate variable for text parameters was having one place where the text parameters are forming. So, in the future if we need additional options to pass to the Text datasource, don't need to modify all places where they are forming. @viirya If you believe, it over-complicates the implementation, I will remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me it's less possibly we have new options which are not passed by parameters. If you think it can be, I'm fine with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we make this less complicated and do what @MaxGekk's said in place next time?

settings
}

val textOptions = ListMap(parameters.toList: _*)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow JSONOptions to make it @transient and use parameters instead of creating a new map?

factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars)
}

@transient val textOptions = parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me it's less possibly we have new options which are not passed by parameters. If you think it can be, I'm fine with this.

@HyukjinKwon
Copy link
Member

Seems fine otherwise.

import java.nio.charset.StandardCharsets
import java.util.{Locale, TimeZone}

import scala.collection.immutable.ListMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems not used.

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90270 has finished for PR 21182 at commit d138c44.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90269 has finished for PR 21182 at commit cd1ffad.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 6, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90276 has finished for PR 21182 at commit 6610826.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90284 has finished for PR 21182 at commit 6610826.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 6, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 6, 2018

Test build #90286 has finished for PR 21182 at commit 6610826.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 8, 2018

@HyukjinKwon @viirya @gengliangwang May I ask you to look at the PR again.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in e3de6ab May 9, 2018
@cloud-fan
Copy link
Contributor

shall we backport it to 2.3?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 9, 2018

I tried but it had conflict when I merged this. @MaxGekk mind opening a backport pr?

@MaxGekk
Copy link
Member Author

MaxGekk commented May 9, 2018

@HyukjinKwon Sure, I will prepare a PR

@MaxGekk
Copy link
Member Author

MaxGekk commented May 10, 2018

Here is the backport to 2.3: #21292

@MaxGekk MaxGekk deleted the text-options branch August 17, 2019 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants