Add support for values with self-closing tags #285

BioQwer · 2018-03-08T10:22:00Z

add test case
pump version

add test case pump version Signed-off-by: Anton Alexandrov <[email protected]>

Signed-off-by: Anton Alexandrov <[email protected]>

Signed-off-by: Antony Alexandrov <[email protected]>

codecov-io · 2018-03-09T12:17:49Z

Codecov Report

Merging #285 into master will decrease coverage by 0.12%.
The diff coverage is 95.65%.

@@            Coverage Diff            @@
##           master    #285      +/-   ##
=========================================
- Coverage   88.52%   88.4%   -0.13%     
=========================================
  Files          14      14              
  Lines         732     733       +1     
  Branches      101      98       -3     
=========================================
  Hits          648     648              
- Misses         84      85       +1

Impacted Files	Coverage Δ
...la/com/databricks/spark/xml/util/InferSchema.scala	`87.14% <100%> (ø)`	⬆️
.../scala/com/databricks/spark/xml/util/XmlFile.scala	`100% <100%> (ø)`	⬆️
...abricks/spark/xml/parsers/StaxXmlParserUtils.scala	`97.87% <100%> (+0.04%)`	⬆️
...cala/com/databricks/spark/xml/XmlInputFormat.scala	`94.35% <100%> (+0.18%)`	⬆️
...m/databricks/spark/xml/parsers/StaxXmlParser.scala	`97.08% <100%> (ø)`	⬆️
...in/scala/com/databricks/spark/xml/XmlOptions.scala	`94.28% <50%> (-2.78%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e5d972...1ff0468. Read the comment docs.

BioQwer · 2018-03-09T12:40:47Z

Hello @HyukjinKwon
Can you merge this changes?

HyukjinKwon · 2018-03-10T13:11:15Z

@BioQwer, thanks for this fix. Seems roughly fine. Will try to take a closer look soon.

BioQwer · 2018-03-11T00:00:05Z

@HyukjinKwon really looking forward, because it covers a lot of problems.
Some people used to fix it, but they have mistakes and don't working unit tests.

ashokkumargopu · 2018-03-13T05:21:00Z

Hi,

The last release of databricks spark-XML version 0.4.1 was on Nov 6, 2016. Can someone update the next release details of databricks spark-xml?

Thanks in Advance

Signed-off-by: Antony Alexandrov <[email protected]>

BioQwer · 2018-03-19T15:36:00Z

@HyukjinKwon hello, what about merge my changes?

RJKeevil · 2018-03-22T12:51:38Z

src/main/scala/com/databricks/spark/xml/XmlOptions.scala


  require(rowTag.nonEmpty, "'rowTag' option should not be empty string.")
-  require(attributePrefix.nonEmpty, "'attributePrefix' option should not be empty string.")
+  logger.warn("'attributePrefix' option should not be empty string.")


I think this still needs to be conditional, else it will always fire the warning?

@RJKeevil one xml = one warning
For my xml i need empty attributePrefix, because all my values in attributes

My task it's use fias (All addresses of Russia) xml in spark.
I think many Russains will collide, in this problem.

I understand, but the way the code is written it will fire the empty attribute prefix warning, whether it is empty or not? I have a prefix of _ defined and I get that warning in the logs

Oy, i understand, i will fix it.

Signed-off-by: Antony Alexandrov <[email protected]>

jaybooth4 · 2018-03-25T18:15:59Z

When is the new version of spark-xml with this support going to be released?

BioQwer · 2018-03-26T09:09:01Z

@jaybooth4 Hello
temporary you can use this build

Hadoop added a woodstox dependency in https://issues.apache.org/jira/browse/HADOOP-14501. When using versions of Hadoop with that dependency, WstxOutputFactory will get loaded for XMLOutputFactory.newInstance. The resulting streamwriter will ensure that a client does not write two root elements, which causes the current code to error out. This PR makes it such that the writer function will start the root tag before writing rows. Author: Patrick Woody <[email protected]> Closes #282 from pwoody/pw/multipleRootFix.

HyukjinKwon · 2018-04-02T07:10:14Z

src/main/scala/com/databricks/spark/xml/XmlOptions.scala

  val permissive = ParseModes.isPermissiveMode(parseMode)

  require(rowTag.nonEmpty, "'rowTag' option should not be empty string.")
-  require(attributePrefix.nonEmpty, "'attributePrefix' option should not be empty string.")


@BioQwer, BTW, mind if I ask why we warn here instead of the exception?

@HyukjinKwon For my xml i need empty attributePrefix, because all my values in attributes.
My task it's use fias (All addresses of Russia) xml in spark.
I think many Russains will collide, in this problem.

HyukjinKwon · 2018-04-02T07:10:31Z

src/main/scala/com/databricks/spark/xml/parsers/StaxXmlParser.scala

      options: XmlOptions,
      rootAttributes: Array[Attribute] = Array.empty): Row = {
    val row = new Array[Any](schema.length)
+    val nameToIndex = schema.map(_.name).zipWithIndex.toMap


Would you mind if I ask to elaborate this change?

@HyukjinKwon if we hava attributes, we need them firstly before reading values.
before we have if we hava attributes, and don't have values -> don't read them. I fix it

HyukjinKwon · 2018-04-02T07:10:53Z

src/main/scala/com/databricks/spark/xml/util/InferSchema.scala

          Some(inferObject(parser, options, rootAttributes))
        } catch {
-          case NonFatal(_) if shouldHandleCorruptRecord =>
+          case NonFatal(x) if shouldHandleCorruptRecord =>


I think _ is fine since that's not used here.

I will check it

HyukjinKwon · 2018-04-02T07:12:06Z

src/test/scala/com/databricks/spark/xml/XmlSuite.scala

 import org.apache.hadoop.io.{LongWritable, Text}
 import org.apache.hadoop.io.compress.GzipCodec
 import org.scalatest.{BeforeAndAfterAll, FunSuite}
-


This empty line is actually a style - https://github.com/databricks/scala-style-guide#imports :-)

I will fix it

it will be better if idea will tip about it :)

HyukjinKwon · 2018-04-02T07:13:02Z

src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala

    var ei = 0
    var depth = 0
+
+    def checkEmptyTag(currentLetter: Int, position: Int): Boolean = {


Will take another look for this logic.

Logic is usualy you try find this end </rootTag>, i add check and for </>.
Each time we check both of this situation.
If you any suggestions for this logic i think it's better do on refactor in next version.

Yea, got it but I simply meant just double checking :).

add test case pump version Signed-off-by: Anton Alexandrov <[email protected]>

Signed-off-by: Anton Alexandrov <[email protected]>

Signed-off-by: Antony Alexandrov <[email protected]>

# Conflicts: # src/test/scala/com/databricks/spark/xml/XmlSuite.scala

BioQwer · 2018-04-04T12:50:51Z

I did fix it.

belenaj · 2018-04-04T13:20:31Z

Hi BioQwer,
with your release v0.4.2_empty_prefix, I still can't make it work for my XMLs (please check PR #291)

Any ideas so far?
Thanks!

BioQwer · 2018-04-04T17:19:22Z

@jbelenag Hello!
I did answer your issue.
@HyukjinKwon this PR doesn't fix #291

HyukjinKwon · 2018-04-05T13:30:25Z

you are right. I rushed to take a look.

BioQwer · 2018-04-05T15:54:08Z

@HyukjinKwon what about merging this PR? :)

BioQwer · 2018-04-24T10:00:01Z

@HyukjinKwon what's happend?
I did fix and review fixes.
Many people need this fix.
Now my changes has conflicts with master :(

HyukjinKwon · 2018-05-13T11:04:26Z

Guys, sorry for the late response. I manually resolved conflicts and opened #303 with his commit. It's pretty core fix so I had to be very careful before merging it in.

BioQwer and others added 2 commits March 8, 2018 13:18

make fix for rows with empty end tag

fb488fb

add test case pump version Signed-off-by: Anton Alexandrov <[email protected]>

make correct read but still schema creation with error

3361d4b

Signed-off-by: Anton Alexandrov <[email protected]>

BioQwer changed the title ~~make fix for rows with empty end tag~~ Add support for values with self-closing tags Mar 9, 2018

BioQwer added 2 commits March 9, 2018 10:43

all works is fine

a137182

Signed-off-by: Antony Alexandrov <[email protected]>

prepare for merge

b2f6aeb

Signed-off-by: Antony Alexandrov <[email protected]>

BioQwer mentioned this pull request Mar 9, 2018

Double Closing Tag are created for self closing element when using custom schema as StringType #241

Closed

BioQwer added 2 commits March 16, 2018 12:22

attributePrefix not require to be nonEmpty

880478d

Signed-off-by: Antony Alexandrov <[email protected]>

delete test for empty attribute prefix

5f0d3e4

Signed-off-by: Antony Alexandrov <[email protected]>

RJKeevil reviewed Mar 22, 2018

View reviewed changes

fix attribute prefix warn only if empty

a627347

Signed-off-by: Antony Alexandrov <[email protected]>

HyukjinKwon reviewed Apr 2, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Apr 2, 2018

Extracting text and schema from single tag #291

Closed

BioQwer and others added 4 commits April 4, 2018 12:51

make fix for rows with empty end tag

207f24c

add test case pump version Signed-off-by: Anton Alexandrov <[email protected]>

make correct read but still schema creation with error

639f86a

Signed-off-by: Anton Alexandrov <[email protected]>

all works is fine

a1a0103

Signed-off-by: Antony Alexandrov <[email protected]>

prepare for merge

2172e96

Signed-off-by: Antony Alexandrov <[email protected]>

BioQwer added 6 commits April 4, 2018 12:51

attributePrefix not require to be nonEmpty

8e8ba59

Signed-off-by: Antony Alexandrov <[email protected]>

delete test for empty attribute prefix

25cdaf8

Signed-off-by: Antony Alexandrov <[email protected]>

fix attribute prefix warn only if empty

b0b3972

Signed-off-by: Antony Alexandrov <[email protected]>

little fix after review

60d8bc6

Merge remote-tracking branch 'origin/master'

e05d1f0

# Conflicts: # src/test/scala/com/databricks/spark/xml/XmlSuite.scala

little fix after review

1ff0468

HyukjinKwon mentioned this pull request May 13, 2018

Add support for values with self-closing tags #303

Merged

HyukjinKwon closed this in #303 May 16, 2018

Add support for values with self-closing tags #285

Add support for values with self-closing tags #285

Uh oh!

Conversation

BioQwer commented Mar 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Mar 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

BioQwer commented Mar 9, 2018

Uh oh!

HyukjinKwon commented Mar 10, 2018

Uh oh!

BioQwer commented Mar 11, 2018

Uh oh!

ashokkumargopu commented Mar 13, 2018

Uh oh!

BioQwer commented Mar 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaybooth4 commented Mar 25, 2018

Uh oh!

BioQwer commented Mar 26, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BioQwer Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BioQwer commented Apr 4, 2018

Uh oh!

belenaj commented Apr 4, 2018

Uh oh!

BioQwer commented Apr 4, 2018

Uh oh!

HyukjinKwon commented Apr 5, 2018

Uh oh!

BioQwer commented Apr 5, 2018

Uh oh!

BioQwer commented Apr 24, 2018

Uh oh!

HyukjinKwon commented May 13, 2018

Uh oh!

Reviewers

Assignees

Labels

BioQwer commented Mar 8, 2018 •

edited

Loading

codecov-io commented Mar 9, 2018 •

edited

Loading

BioQwer Apr 2, 2018 •

edited

Loading