Hey,
I recently encounter a bug when I have to use custom schema to define some elements as StringType. And if the child element has self-closing tag, the String value of element be change to beginning tag, closing tag and closing tag, breaking the XML format.
partial input xml:
<title>XML Developer's Guide</title>Computer44.952000-10-01<title>XML Developer's Guide</title>.....
Notice inside stringcontent tag, have and <publish_date/> as self-closing tag.
This is the custom schema:
val customSchema = StructType(Array(
StructField("_id", StringType, nullable = true),
StructField("stringcontent", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("genre", StringType ,nullable = true),
StructField("price", DoubleType, nullable = true),
StructField("publish_date", StringType, nullable = true),
StructField("title", StringType, nullable = true)))
spark DF of first row:
res6: org.apache.spark.sql.Row = [null,<title>XML Developer's Guide</title>Computer44.95<publish_date>2000-10-01</publish_date><publish_date></publish_date></publish_date>,An in-depth look at creating applications with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage, and query XML data in the database. After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository, the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB application. It provides examples of how and where you can use Oracle XML DB. The manual then describes ways you can store and retrieve XML data using Oracle XML DB, AP...
notice the self closing tag are changed to beginning tag and two closing tag.
books.txt
version: spark 1.5.1
scala 2.10.4
spark-xml 2.10-0.3.4