-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3205] add EscapedTextInputFormat #2118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2118 at commit
|
|
This does not need to be in spark core. Btw, since we allow for any arbitrary InputFormat to be used in spark, users can use any existing hadoop inputformat/outputformat for this purpose. |
|
QA tests have finished for PR 2118 at commit
|
|
@mridulm Any reference of existing input format? I searched on Google. The closest I found is https://github.com/msukmanowsky/OmnitureDataFileInputFormat but it is different. |
|
QA tests have started for PR 2118 at commit
|
|
@mengxr Other than custom input/output format's i have written; iirc pig and jaql support this and both are opensource and run on top of hadoop, so they have input/output format's for this - though not sure if it is possible to directly import their code (might bring in too many other dependencies, and might be within deep layers of their abstractions). There are also csv based reader/writers out there which allow us to customize the escape and delimiter characters - might be possible to customize them I suppose - though I have not investigated it in detail. Even assuming we cant borrow this from an external source verbatim and have to author it ourself, I am not in favor of putting it in core. |
f0e3842 to
ac0ace8
Compare
ac0ace8 to
e35a366
Compare
|
QA tests have started for PR 2118 at commit
|
|
QA tests have finished for PR 2118 at commit
|
|
QA tests have finished for PR 2118 at commit
|
|
QA tests have started for PR 2118 at commit
|
|
QA tests have finished for PR 2118 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to put this in the "io" package in case we also create output formats later. But no strong feelings. I guess the Hadoop2 one is called "input", it's just weird to make a new package just for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW if you do add a new package you'll have to fix the SBT code that generates Javadocs and Scaladocs to make sure it appears in the right ones.
|
@mridulm I moved the implementation in https://github.com/mengxr/redshift-input-format and I'm closing this PR for now. If people feel that this input format is very useful, we can put it back to Spark Core later. Thanks @mridulm and @mateiz for the code review! |
|
Since we j might keep needing to add input formats, how about creating
|
|
I like the idea of a separate Maven artifact for this. IMO we should try to have common formats easily accessible in Spark, but if core depends on spark-hadoop-io, that will solve that problem. |
Text records may contain in-record delimiter or newline characters. In such cases, we can either encode them or escape them. The latter is simpler and used by Redshift's UNLOAD with the ESCAPE option. The problem is that a record will span multiple lines. We need an input format for it.
@marmbrus