Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 29, 2018

What changes were proposed in this pull request?

I propose flexible format for the lineSep option used in text datasources like Json. New format of the option has the following syntax:

lineSep ::= (selector separator-spec) | text-separator
selector := 'x' | '\' | reserved-selector
reserved-selector ::= '\' | 'r'
separator-spec ::= < valid string literal in Python, R, Java and Scala>
text-separator ::= first-char separator-spec
first-char ::= ! selector

Examples of lineSep in the new format:

x0a.00.00.00 0d.00.00.00
x5445
|^|
\r\n
-
sep

The '\' and 'r' are reserved for future usage. For instance, 'r' could be used for regular expressions line r[0-9]+ or r(x1E|x0Ax1E|x0A) for parsing Json Streaming

New format addresses the use cases:

  1. Hexadecimal format allows to specify lineSep independently from encoding. It gives opportunity for reading json files with BOM in per-line mode. See [SPARK-23723] New charset option for json datasource #20849 (comment)

  2. Jsons coming usually from embedded systems have not-standard separators (invisible in some cases). It is very convenient to open a file in hex editor and copy bytes between }{ to the lineSep option. This is the use case for the format with 'x' selector like: x0d 54 45

  3. In Json Streaming, records could be separated in pretty different ways. We should leave room for improvement I believe. See 'r' (for regexp) and '/' reserved selectors

  4. Some UTF-8 chars could cause errors from style (format) checkers. It is easier to represent such chars in hexadecimal format instead of disabling the checkers.

  5. In near future, json datasource will support input json in different charsets. If the source code in UTF-8 but input json in different charset, it is slightly hard to put such chars as values for the lineSep option. The x<hexs> format is more convenient here again.

How was this patch tested?

The changes are checked by 2 new tests in which JSON files in UTF-16 and UTF-32 with BOM are read. Also 2 new cases for an existing test are added.

@SparkQA
Copy link

SparkQA commented Apr 29, 2018

Test build #89973 has finished for PR 21192 at commit 4b3592f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2018

Test build #89972 has finished for PR 21192 at commit 60d5828.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 30, 2018

jenkins, retest this, please

* selector (1 char) + separator spec (any length) | sequence of chars
*
* Currently the following selectors are supported:
* - 'x' + sequence of bytes in hexadecimal format. For example: "x0a 0d".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this one (sequence of bytes) should be supported via option APIs, not only for lineSep specifically. For example, I proposed an array type support in option - #16611. Maybe, we could try a similar approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your approach looks interesting but your PRs #20125 and #16611 stuck. Are there any specific reasons for that?

Maybe, we could try a similar approach.

We could but it requires extension of public API more widely than it is needed by this changes. Do you think it makes sense? Actually I would consider lineSep as a string in JSON format like:

.option("lineSep", "[0x00, 0x0A, 0x00, 0x0D]")

or

.option("lineSep", """{"sep": "\r\n", "encoding": "UTF-16LE"}""")

but for regular cases of simple lineSep, this approach looks pretty redundant:

.option("lineSep", ",") vs .option("lineSep", """{"sep": ","}""")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why they are stuck. I was just thinking of doing sequence of bytes separately since I feel in the same way about the need and there's a PR that potentially covers this case.
I am less sure about other formats yet - don't feel strongly but I don't have better alternatives yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification, you will accept only the solution with an array for lineSep, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I thought and I hope the sequence of bytes alone is taken out here if you won't mind. In this case, I don't want to block this PR too likewise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I misunderstood this PR because you proposed a regex support too before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reserved the r prefix for that. It is hard to support it right now because it requires some support from Hadoop's LineReader.

@SparkQA
Copy link

SparkQA commented Apr 30, 2018

Test build #89976 has finished for PR 21192 at commit 4b3592f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 2, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented May 2, 2018

Test build #90050 has finished for PR 21192 at commit bcc17e0.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 17, 2018

eh I actually think separated makes it much simpler to look at, compared with an array. Why complicate the API and require users to understand how to specify an array (in all languages)?

One question I have is whether we'd want to support multiple separators and each separator can be multi sequence chars as well. In that case an array might make more sense to specify the multi separators, and each separator is just space delimited for chars.

@HyukjinKwon
Copy link
Member

eh I actually think separated makes it much simpler to look at, compared with an array. Why complicate the API and require users to understand how to specify an array (in all languages)?

Hey @rxin, it's because array can be given to an option commonly. There are many cases in CSV at least. What's actual diff with the proposed change here and array at the core? Array option api at least provide a complication safe and cleaner way for users.

@rxin
Copy link
Contributor

rxin commented May 17, 2018

my point is that i don't consider a sequence of chars an array to begin with. it is not natural to me.

I'd want an array if it is a different set of separators.

@HyukjinKwon
Copy link
Member

Ah, I misunderstood. I suggested to take out its sequence support here for now anyway.

@cloud-fan
Copy link
Contributor

It seems good to use the array option for the hexadecimal sequence, if we have the array option feature. But I don't think we should block this PR just because the array option is not ready.

As long as we can finish the array option before Spark 2.4, we can still change the lineSeparator option and use the array option.

@HyukjinKwon
Copy link
Member

sorry, why don't we do the interface thing first?

@cloud-fan
Copy link
Contributor

I don't see the necessity of having an order here. We can do the array option interface, or this one first.

Using format like x0a 0d is acceptable here, while it seems better to use array option.

From engineering point of view, we should try out best to avoid dependencies between different works.

If you think x0a 0d is a bad format, then it's fair to block this PR and we can discuss what's wrong with it.

@HyukjinKwon
Copy link
Member

Everything here should be rewritten if we go for the array way if I'm not mistaken. In this case, I think we should better, in a way, do the array option if we are all willing to add it within 2.4.0 to reduce the work and effort. FWIW, I don't feel strongly about being too clever for introducung a custom format of a string when we have other alternatives.

@cloud-fan
Copy link
Contributor

regarding the array option, what's the status? We can provide API at Spark side to easily set Seq/Array to the options but how about the read side? Do users need to understand the array option string format and parse it themselves?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 21, 2018

It's stuck. I kept pinging and gave up.

Do users need to understand the array option string format and parse it themselves?

Yea, I think this is the reason why it's stuck. I believe there's no easier other option so at that time I was trying to document this. It's at least not a custom format but a standard JSON.

@cloud-fan
Copy link
Contributor

I'd like to hide the format from users(including data source developers). In DataSourceOptions I also use json array format to store paths, and use DataSourceOptions#getPaths to retrieve it. Then users don't need to create a json parser to read options.

Maybe we can go with that direction?

@HyukjinKwon
Copy link
Member

I thought about it at that time but I forget why I ended up with giving up (probably because the change will become big and I thought it's not going to be accepted). Will take another refreshing look soon.

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91906 has finished for PR 21192 at commit 9e2157c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2018

Test build #92184 has finished for PR 21192 at commit eab96b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ForeachBatchFunction(object):
  • case class ArrayDistinct(child: Expression)
  • class PythonForeachWriter(func: PythonFunction, schema: StructType)
  • class UnsafeRowBuffer(taskMemoryManager: TaskMemoryManager, tempDir: File, numFields: Int)
  • trait MemorySinkBase extends BaseStreamingSink with Logging
  • class MemorySink(val schema: StructType, outputMode: OutputMode, options: DataSourceOptions)
  • class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: ExpressionEncoder[T])
  • trait PythonForeachBatchFunction
  • case class ForeachWriterProvider[T](
  • case class ForeachWriterFactory[T](
  • class ForeachDataWriter[T](
  • class MemoryWriter(
  • class MemoryStreamWriter(

…sep2

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
@SparkQA
Copy link

SparkQA commented Jul 7, 2018

Test build #92707 has finished for PR 21192 at commit 5384c07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 17, 2018

Looks like there is no consensus for the PR. @rxin @cloud-fan @HyukjinKwon Should I close it?

@HyukjinKwon
Copy link
Member

To me, yes. If you find some times, I would appreciate if you take a look at #21192 (comment) too so that we can review each other whoever make a PR first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants