Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Conversation

@BioQwer
Copy link
Contributor

@BioQwer BioQwer commented Mar 8, 2018

add test case
pump version

BioQwer and others added 2 commits March 8, 2018 13:18
add test case
pump version

Signed-off-by: Anton Alexandrov <[email protected]>
@BioQwer BioQwer changed the title make fix for rows with empty end tag Add support for values with self-closing tags Mar 9, 2018
BioQwer added 2 commits March 9, 2018 10:43
Signed-off-by: Antony Alexandrov <[email protected]>
Signed-off-by: Antony Alexandrov <[email protected]>
@codecov-io
Copy link

codecov-io commented Mar 9, 2018

Codecov Report

Merging #285 into master will decrease coverage by 0.12%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #285      +/-   ##
=========================================
- Coverage   88.52%   88.4%   -0.13%     
=========================================
  Files          14      14              
  Lines         732     733       +1     
  Branches      101      98       -3     
=========================================
  Hits          648     648              
- Misses         84      85       +1
Impacted Files Coverage Δ
...la/com/databricks/spark/xml/util/InferSchema.scala 87.14% <100%> (ø) ⬆️
.../scala/com/databricks/spark/xml/util/XmlFile.scala 100% <100%> (ø) ⬆️
...abricks/spark/xml/parsers/StaxXmlParserUtils.scala 97.87% <100%> (+0.04%) ⬆️
...cala/com/databricks/spark/xml/XmlInputFormat.scala 94.35% <100%> (+0.18%) ⬆️
...m/databricks/spark/xml/parsers/StaxXmlParser.scala 97.08% <100%> (ø) ⬆️
...in/scala/com/databricks/spark/xml/XmlOptions.scala 94.28% <50%> (-2.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e5d972...1ff0468. Read the comment docs.

@BioQwer
Copy link
Contributor Author

BioQwer commented Mar 9, 2018

Hello @HyukjinKwon
Can you merge this changes?

@HyukjinKwon
Copy link
Member

@BioQwer, thanks for this fix. Seems roughly fine. Will try to take a closer look soon.

@BioQwer
Copy link
Contributor Author

BioQwer commented Mar 11, 2018

@HyukjinKwon really looking forward, because it covers a lot of problems.
Some people used to fix it, but they have mistakes and don't working unit tests.

@ashokkumargopu
Copy link

Hi,

The last release of databricks spark-XML version 0.4.1 was on Nov 6, 2016. Can someone update the next release details of databricks spark-xml?

Thanks in Advance

@BioQwer
Copy link
Contributor Author

BioQwer commented Mar 19, 2018

@HyukjinKwon hello, what about merge my changes?


require(rowTag.nonEmpty, "'rowTag' option should not be empty string.")
require(attributePrefix.nonEmpty, "'attributePrefix' option should not be empty string.")
logger.warn("'attributePrefix' option should not be empty string.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs to be conditional, else it will always fire the warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RJKeevil one xml = one warning
For my xml i need empty attributePrefix, because all my values in attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My task it's use fias (All addresses of Russia) xml in spark.
I think many Russains will collide, in this problem.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but the way the code is written it will fire the empty attribute prefix warning, whether it is empty or not? I have a prefix of _ defined and I get that warning in the logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oy, i understand, i will fix it.

@jaybooth4
Copy link

When is the new version of spark-xml with this support going to be released?

@BioQwer
Copy link
Contributor Author

BioQwer commented Mar 26, 2018

@jaybooth4 Hello
temporary you can use this build

Hadoop added a woodstox dependency in https://issues.apache.org/jira/browse/HADOOP-14501. When using versions of Hadoop with that dependency, WstxOutputFactory will get loaded for XMLOutputFactory.newInstance. The resulting streamwriter will ensure that a client does not write two root elements, which causes the current code to error out.

This PR makes it such that the writer function will start the root tag before writing rows.

Author: Patrick Woody <[email protected]>

Closes #282 from pwoody/pw/multipleRootFix.
val permissive = ParseModes.isPermissiveMode(parseMode)

require(rowTag.nonEmpty, "'rowTag' option should not be empty string.")
require(attributePrefix.nonEmpty, "'attributePrefix' option should not be empty string.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BioQwer, BTW, mind if I ask why we warn here instead of the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon For my xml i need empty attributePrefix, because all my values in attributes.
My task it's use fias (All addresses of Russia) xml in spark.
I think many Russains will collide, in this problem.

options: XmlOptions,
rootAttributes: Array[Attribute] = Array.empty): Row = {
val row = new Array[Any](schema.length)
val nameToIndex = schema.map(_.name).zipWithIndex.toMap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind if I ask to elaborate this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon if we hava attributes, we need them firstly before reading values.
before we have if we hava attributes, and don't have values -> don't read them. I fix it

Some(inferObject(parser, options, rootAttributes))
} catch {
case NonFatal(_) if shouldHandleCorruptRecord =>
case NonFatal(x) if shouldHandleCorruptRecord =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _ is fine since that's not used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check it

import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.io.compress.GzipCodec
import org.scalatest.{BeforeAndAfterAll, FunSuite}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty line is actually a style - https://github.com/databricks/scala-style-guide#imports :-)

Copy link
Contributor Author

@BioQwer BioQwer Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be better if idea will tip about it :)

var ei = 0
var depth = 0

def checkEmptyTag(currentLetter: Int, position: Int): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take another look for this logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic is usualy you try find this end </rootTag>, i add check and for </>.
Each time we check both of this situation.
If you any suggestions for this logic i think it's better do on refactor in next version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, got it but I simply meant just double checking :).

BioQwer and others added 4 commits April 4, 2018 12:51
add test case
pump version

Signed-off-by: Anton Alexandrov <[email protected]>
Signed-off-by: Antony Alexandrov <[email protected]>
Signed-off-by: Antony Alexandrov <[email protected]>
@BioQwer
Copy link
Contributor Author

BioQwer commented Apr 4, 2018

I did fix it.

@belenaj
Copy link

belenaj commented Apr 4, 2018

Hi BioQwer,
with your release v0.4.2_empty_prefix, I still can't make it work for my XMLs (please check PR #291)

Any ideas so far?
Thanks!

@BioQwer
Copy link
Contributor Author

BioQwer commented Apr 4, 2018

@jbelenag Hello!
I did answer your issue.
@HyukjinKwon this PR doesn't fix #291

@HyukjinKwon
Copy link
Member

you are right. I rushed to take a look.

@BioQwer
Copy link
Contributor Author

BioQwer commented Apr 5, 2018

@HyukjinKwon what about merging this PR? :)

@BioQwer
Copy link
Contributor Author

BioQwer commented Apr 24, 2018

@HyukjinKwon what's happend?
I did fix and review fixes.
Many people need this fix.
Now my changes has conflicts with master :(

@HyukjinKwon
Copy link
Member

Guys, sorry for the late response. I manually resolved conflicts and opened #303 with his commit. It's pretty core fix so I had to be very careful before merging it in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants