Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

Double Closing Tag are created for self closing element when using custom schema as StringType #241

@jzhao415

Description

@jzhao415

Hey,

I recently encounter a bug when I have to use custom schema to define some elements as StringType. And if the child element has self-closing tag, the String value of element be change to beginning tag, closing tag and closing tag, breaking the XML format.

partial input xml:

<title>XML Developer's Guide</title>Computer44.952000-10-01<title>XML Developer's Guide</title>.....

Notice inside stringcontent tag, have and <publish_date/> as self-closing tag.

This is the custom schema:
val customSchema = StructType(Array(
StructField("_id", StringType, nullable = true),
StructField("stringcontent", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("genre", StringType ,nullable = true),
StructField("price", DoubleType, nullable = true),
StructField("publish_date", StringType, nullable = true),
StructField("title", StringType, nullable = true)))

spark DF of first row:
res6: org.apache.spark.sql.Row = [null,<title>XML Developer's Guide</title>Computer44.95<publish_date>2000-10-01</publish_date><publish_date></publish_date></publish_date>,An in-depth look at creating applications with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage, and query XML data in the database. After introducing you to the heart of Oracle XML DB, namely the XMLType framework and Oracle XML DB repository, the manual provides a brief introduction to design criteria to consider when planning your Oracle XML DB application. It provides examples of how and where you can use Oracle XML DB. The manual then describes ways you can store and retrieve XML data using Oracle XML DB, AP...

notice the self closing tag are changed to beginning tag and two closing tag.

books.txt

version: spark 1.5.1
scala 2.10.4
spark-xml 2.10-0.3.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions