[SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON datasources #21192

MaxGekk · 2018-04-29T19:56:22Z

What changes were proposed in this pull request?

I propose flexible format for the lineSep option used in text datasources like Json. New format of the option has the following syntax:

lineSep ::= (selector separator-spec) | text-separator
selector := 'x' | '\' | reserved-selector
reserved-selector ::= '\' | 'r'
separator-spec ::= < valid string literal in Python, R, Java and Scala>
text-separator ::= first-char separator-spec
first-char ::= ! selector

Examples of lineSep in the new format:

x0a.00.00.00 0d.00.00.00
x5445
|^|
\r\n
-
sep

The '\' and 'r' are reserved for future usage. For instance, 'r' could be used for regular expressions line r[0-9]+ or r(x1E|x0Ax1E|x0A) for parsing Json Streaming

New format addresses the use cases:

Hexadecimal format allows to specify lineSep independently from encoding. It gives opportunity for reading json files with BOM in per-line mode. See [SPARK-23723] New charset option for json datasource #20849 (comment)
Jsons coming usually from embedded systems have not-standard separators (invisible in some cases). It is very convenient to open a file in hex editor and copy bytes between }{ to the lineSep option. This is the use case for the format with 'x' selector like: x0d 54 45
In Json Streaming, records could be separated in pretty different ways. We should leave room for improvement I believe. See 'r' (for regexp) and '/' reserved selectors
Some UTF-8 chars could cause errors from style (format) checkers. It is easier to represent such chars in hexadecimal format instead of disabling the checkers.
In near future, json datasource will support input json in different charsets. If the source code in UTF-8 but input json in different charset, it is slightly hard to put such chars as values for the lineSep option. The x<hexs> format is more convenient here again.

How was this patch tested?

The changes are checked by 2 new tests in which JSON files in UTF-16 and UTF-32 with BOM are read. Also 2 new cases for an existing test are added.

SparkQA · 2018-04-29T22:05:19Z

Test build #89973 has finished for PR 21192 at commit 4b3592f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-29T23:50:34Z

Test build #89972 has finished for PR 21192 at commit 60d5828.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-04-30T06:42:27Z

jenkins, retest this, please

HyukjinKwon · 2018-04-30T06:46:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

+   *   selector (1 char) + separator spec (any length) | sequence of chars
+   *
+   * Currently the following selectors are supported:
+   * - 'x' + sequence of bytes in hexadecimal format. For example: "x0a 0d".


I actually think this one (sequence of bytes) should be supported via option APIs, not only for lineSep specifically. For example, I proposed an array type support in option - #16611. Maybe, we could try a similar approach.

Your approach looks interesting but your PRs #20125 and #16611 stuck. Are there any specific reasons for that?

Maybe, we could try a similar approach.

We could but it requires extension of public API more widely than it is needed by this changes. Do you think it makes sense? Actually I would consider lineSep as a string in JSON format like:

.option("lineSep", "[0x00, 0x0A, 0x00, 0x0D]")

or

.option("lineSep", """{"sep": "\r\n", "encoding": "UTF-16LE"}""")

but for regular cases of simple lineSep, this approach looks pretty redundant:

.option("lineSep", ",") vs .option("lineSep", """{"sep": ","}""")

I don't know why they are stuck. I was just thinking of doing sequence of bytes separately since I feel in the same way about the need and there's a PR that potentially covers this case.
I am less sure about other formats yet - don't feel strongly but I don't have better alternatives yet.

Just for clarification, you will accept only the solution with an array for lineSep, right?

Yes, that's what I thought and I hope the sequence of bytes alone is taken out here if you won't mind. In this case, I don't want to block this PR too likewise.

BTW, I misunderstood this PR because you proposed a regex support too before.

I reserved the r prefix for that. It is hard to support it right now because it requires some support from Hadoop's LineReader.

SparkQA · 2018-04-30T07:05:01Z

Test build #89976 has finished for PR 21192 at commit 4b3592f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-05-02T11:08:12Z

jenkins, retest this, please

SparkQA · 2018-05-02T14:51:33Z

Test build #90050 has finished for PR 21192 at commit bcc17e0.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-05-17T05:26:00Z

eh I actually think separated makes it much simpler to look at, compared with an array. Why complicate the API and require users to understand how to specify an array (in all languages)?

One question I have is whether we'd want to support multiple separators and each separator can be multi sequence chars as well. In that case an array might make more sense to specify the multi separators, and each separator is just space delimited for chars.

HyukjinKwon · 2018-05-17T06:37:07Z

eh I actually think separated makes it much simpler to look at, compared with an array. Why complicate the API and require users to understand how to specify an array (in all languages)?

Hey @rxin, it's because array can be given to an option commonly. There are many cases in CSV at least. What's actual diff with the proposed change here and array at the core? Array option api at least provide a complication safe and cleaner way for users.

rxin · 2018-05-17T06:39:18Z

my point is that i don't consider a sequence of chars an array to begin with. it is not natural to me.

I'd want an array if it is a different set of separators.

HyukjinKwon · 2018-05-17T06:44:02Z

Ah, I misunderstood. I suggested to take out its sequence support here for now anyway.

cloud-fan · 2018-05-17T14:04:08Z

It seems good to use the array option for the hexadecimal sequence, if we have the array option feature. But I don't think we should block this PR just because the array option is not ready.

As long as we can finish the array option before Spark 2.4, we can still change the lineSeparator option and use the array option.

HyukjinKwon · 2018-05-17T17:17:15Z

sorry, why don't we do the interface thing first?

cloud-fan · 2018-05-18T08:35:23Z

I don't see the necessity of having an order here. We can do the array option interface, or this one first.

Using format like x0a 0d is acceptable here, while it seems better to use array option.

From engineering point of view, we should try out best to avoid dependencies between different works.

If you think x0a 0d is a bad format, then it's fair to block this PR and we can discuss what's wrong with it.

HyukjinKwon · 2018-05-18T09:31:47Z

Everything here should be rewritten if we go for the array way if I'm not mistaken. In this case, I think we should better, in a way, do the array option if we are all willing to add it within 2.4.0 to reduce the work and effort. FWIW, I don't feel strongly about being too clever for introducung a custom format of a string when we have other alternatives.

cloud-fan · 2018-05-18T09:43:46Z

regarding the array option, what's the status? We can provide API at Spark side to easily set Seq/Array to the options but how about the read side? Do users need to understand the array option string format and parse it themselves?

HyukjinKwon · 2018-05-21T02:37:40Z

It's stuck. I kept pinging and gave up.

Do users need to understand the array option string format and parse it themselves?

Yea, I think this is the reason why it's stuck. I believe there's no easier other option so at that time I was trying to document this. It's at least not a custom format but a standard JSON.

cloud-fan · 2018-05-23T16:11:36Z

I'd like to hide the format from users(including data source developers). In DataSourceOptions I also use json array format to store paths, and use DataSourceOptions#getPaths to retrieve it. Then users don't need to create a json parser to read options.

Maybe we can go with that direction?

HyukjinKwon · 2018-05-27T10:11:36Z

I thought about it at that time but I forget why I ended up with giving up (probably because the change will become big and I thought it's not going to be accepted). Will take another refreshing look soon.

…sep2

SparkQA · 2018-06-15T13:19:53Z

Test build #91906 has finished for PR 21192 at commit 9e2157c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-22T00:32:18Z

Test build #92184 has finished for PR 21192 at commit eab96b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ForeachBatchFunction(object):
case class ArrayDistinct(child: Expression)
class PythonForeachWriter(func: PythonFunction, schema: StructType)
class UnsafeRowBuffer(taskMemoryManager: TaskMemoryManager, tempDir: File, numFields: Int)
trait MemorySinkBase extends BaseStreamingSink with Logging
class MemorySink(val schema: StructType, outputMode: OutputMode, options: DataSourceOptions)
class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: ExpressionEncoder[T])
trait PythonForeachBatchFunction
case class ForeachWriterProvider[T](
case class ForeachWriterFactory[T](
class ForeachDataWriter[T](
class MemoryWriter(
class MemoryStreamWriter(

…sep2 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

SparkQA · 2018-07-07T14:48:16Z

Test build #92707 has finished for PR 21192 at commit 5384c07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-17T19:59:01Z

Looks like there is no consensus for the PR. @rxin @cloud-fan @HyukjinKwon Should I close it?

HyukjinKwon · 2018-07-18T02:32:51Z

To me, yes. If you find some times, I would appreciate if you take a look at #21192 (comment) too so that we can review each other whoever make a PR first.

MaxGekk added 3 commits April 29, 2018 21:33

Flexible format of the lineSep option

60d5828

Add ticket number to tests titles

5f6f8b2

Revert restrictions for UTF-16 and UTF-32

4b3592f

HyukjinKwon reviewed Apr 30, 2018

View reviewed changes

Fix a test: use multi-line files as an input

bcc17e0

Merge remote-tracking branch 'origin/master' into json-flexible-line-…

9e2157c

…sep2

Merge branch 'master' into json-flexible-line-sep2

eab96b4

Merge remote-tracking branch 'origin/master' into json-flexible-line-…

5384c07

…sep2 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

MaxGekk closed this Jul 18, 2018

HyukjinKwon mentioned this pull request Dec 21, 2018

[SPARK-17967][SQL] Support for array as an option in SQL parser #20125

Closed

MaxGekk deleted the json-flexible-line-sep2 branch August 17, 2019 13:34

[SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON datasources #21192

[SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON datasources #21192

Uh oh!

Conversation

MaxGekk commented Apr 29, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 29, 2018

Uh oh!

SparkQA commented Apr 29, 2018

Uh oh!

MaxGekk commented Apr 30, 2018

Uh oh!

HyukjinKwon Apr 30, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 1, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 2, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 6, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 6, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 17, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 17, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 30, 2018

Uh oh!

MaxGekk commented May 2, 2018

Uh oh!

SparkQA commented May 2, 2018

Uh oh!

rxin commented May 17, 2018

Uh oh!

HyukjinKwon commented May 17, 2018

Uh oh!

rxin commented May 17, 2018

Uh oh!

HyukjinKwon commented May 17, 2018

Uh oh!

cloud-fan commented May 17, 2018

Uh oh!

HyukjinKwon commented May 17, 2018

Uh oh!

cloud-fan commented May 18, 2018

Uh oh!

HyukjinKwon commented May 18, 2018

Uh oh!

cloud-fan commented May 18, 2018

Uh oh!

HyukjinKwon commented May 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented May 23, 2018

Uh oh!

HyukjinKwon commented May 27, 2018

Uh oh!

SparkQA commented Jun 15, 2018

Uh oh!

SparkQA commented Jun 22, 2018

Uh oh!

SparkQA commented Jul 7, 2018

Uh oh!

MaxGekk commented Jul 17, 2018

Uh oh!

HyukjinKwon commented Jul 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

HyukjinKwon commented May 21, 2018 •

edited

Loading