-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON datasources #21192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #89973 has finished for PR 21192 at commit
|
|
Test build #89972 has finished for PR 21192 at commit
|
|
jenkins, retest this, please |
| * selector (1 char) + separator spec (any length) | sequence of chars | ||
| * | ||
| * Currently the following selectors are supported: | ||
| * - 'x' + sequence of bytes in hexadecimal format. For example: "x0a 0d". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think this one (sequence of bytes) should be supported via option APIs, not only for lineSep specifically. For example, I proposed an array type support in option - #16611. Maybe, we could try a similar approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your approach looks interesting but your PRs #20125 and #16611 stuck. Are there any specific reasons for that?
Maybe, we could try a similar approach.
We could but it requires extension of public API more widely than it is needed by this changes. Do you think it makes sense? Actually I would consider lineSep as a string in JSON format like:
.option("lineSep", "[0x00, 0x0A, 0x00, 0x0D]")or
.option("lineSep", """{"sep": "\r\n", "encoding": "UTF-16LE"}""")but for regular cases of simple lineSep, this approach looks pretty redundant:
.option("lineSep", ",") vs .option("lineSep", """{"sep": ","}""")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why they are stuck. I was just thinking of doing sequence of bytes separately since I feel in the same way about the need and there's a PR that potentially covers this case.
I am less sure about other formats yet - don't feel strongly but I don't have better alternatives yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarification, you will accept only the solution with an array for lineSep, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I thought and I hope the sequence of bytes alone is taken out here if you won't mind. In this case, I don't want to block this PR too likewise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I misunderstood this PR because you proposed a regex support too before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reserved the r prefix for that. It is hard to support it right now because it requires some support from Hadoop's LineReader.
|
Test build #89976 has finished for PR 21192 at commit
|
|
jenkins, retest this, please |
|
Test build #90050 has finished for PR 21192 at commit
|
|
eh I actually think separated makes it much simpler to look at, compared with an array. Why complicate the API and require users to understand how to specify an array (in all languages)? One question I have is whether we'd want to support multiple separators and each separator can be multi sequence chars as well. In that case an array might make more sense to specify the multi separators, and each separator is just space delimited for chars. |
Hey @rxin, it's because array can be given to an option commonly. There are many cases in CSV at least. What's actual diff with the proposed change here and array at the core? Array option api at least provide a complication safe and cleaner way for users. |
|
my point is that i don't consider a sequence of chars an array to begin with. it is not natural to me. I'd want an array if it is a different set of separators. |
|
Ah, I misunderstood. I suggested to take out its sequence support here for now anyway. |
|
It seems good to use the array option for the hexadecimal sequence, if we have the array option feature. But I don't think we should block this PR just because the array option is not ready. As long as we can finish the array option before Spark 2.4, we can still change the |
|
sorry, why don't we do the interface thing first? |
|
I don't see the necessity of having an order here. We can do the array option interface, or this one first. Using format like From engineering point of view, we should try out best to avoid dependencies between different works. If you think |
|
Everything here should be rewritten if we go for the array way if I'm not mistaken. In this case, I think we should better, in a way, do the array option if we are all willing to add it within 2.4.0 to reduce the work and effort. FWIW, I don't feel strongly about being too clever for introducung a custom format of a string when we have other alternatives. |
|
regarding the array option, what's the status? We can provide API at Spark side to easily set Seq/Array to the options but how about the read side? Do users need to understand the array option string format and parse it themselves? |
|
It's stuck. I kept pinging and gave up.
Yea, I think this is the reason why it's stuck. I believe there's no easier other option so at that time I was trying to document this. It's at least not a custom format but a standard JSON. |
|
I'd like to hide the format from users(including data source developers). In Maybe we can go with that direction? |
|
I thought about it at that time but I forget why I ended up with giving up (probably because the change will become big and I thought it's not going to be accepted). Will take another refreshing look soon. |
|
Test build #91906 has finished for PR 21192 at commit
|
|
Test build #92184 has finished for PR 21192 at commit
|
…sep2 # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
|
Test build #92707 has finished for PR 21192 at commit
|
|
Looks like there is no consensus for the PR. @rxin @cloud-fan @HyukjinKwon Should I close it? |
|
To me, yes. If you find some times, I would appreciate if you take a look at #21192 (comment) too so that we can review each other whoever make a PR first. |
What changes were proposed in this pull request?
I propose flexible format for the lineSep option used in text datasources like Json. New format of the option has the following syntax:
Examples of lineSep in the new format:
The
'\'and'r'are reserved for future usage. For instance,'r'could be used for regular expressions liner[0-9]+orr(x1E|x0Ax1E|x0A)for parsing Json StreamingNew format addresses the use cases:
Hexadecimal format allows to specify
lineSepindependently from encoding. It gives opportunity for reading json files with BOM in per-line mode. See [SPARK-23723] New charset option for json datasource #20849 (comment)Jsons coming usually from embedded systems have not-standard separators (invisible in some cases). It is very convenient to open a file in hex editor and copy bytes between }{ to the lineSep option. This is the use case for the format with
'x'selector like:x0d 54 45In Json Streaming, records could be separated in pretty different ways. We should leave room for improvement I believe. See
'r'(for regexp) and'/'reserved selectorsSome UTF-8 chars could cause errors from style (format) checkers. It is easier to represent such chars in hexadecimal format instead of disabling the checkers.
In near future, json datasource will support input json in different charsets. If the source code in UTF-8 but input json in different charset, it is slightly hard to put such chars as values for the lineSep option. The
x<hexs>format is more convenient here again.How was this patch tested?
The changes are checked by 2 new tests in which JSON files in
UTF-16andUTF-32with BOM are read. Also 2 new cases for an existing test are added.