Skip to content

attribute of xml file schema inferred as either array or not #75

@minnieshi

Description

@minnieshi

I am using version 1.5.3 to ingest xml files to kafka

maybe it is a quetion instead of an issue.

This is related to but different than the issues #74
The fix for the above issue, starts to read the attributes of the XML file, but the resulting inferred sechema* can be different when the attribute has only 1 elelement versus when the attribute has more than 1 elements.

Now the situation is, I have 2 xml files, you can see the inferred schema (hence the payload format) is different from the resulting message.

I do not know whether i can force the attribute to be array.

Below is the content of file1

<?xml version="1.0" encoding="UTF-8"?>
<data>
    <field1>field1Value</field1>
    <field2>
        <value attributeWantToBefield="attribute1Value">27</value>
    </field2>
    <field4>2020-08-01T18:00:00</field4>
</data>

below is the content of file 2

<?xml version="1.0" encoding="UTF-8"?>
<data>
    <field1>field1Value</field1>
    <field2>
        <value attributeWantToBefield="attribute1Value">25</value>
        <value attributeWantToBefield="attribute2Value">77</value>
    </field2>
    <field4>1919-08-02T18:02:00</field4>
</data>

from the attached files, you can see

Issue: The payload part is different in 'format', this will make the parsing end json difficult.

payload file 1 (not array)

"payload": {
    "data": {
      "field1": "field1Value",
      "field2": {
        "value": {		   
          "attributeWantToBefield": "attribute1Value",												
          "value": "27"
        }		 
      },
      "field4": "2020-08-01T18:00:00"
    }
  }

payload file 2 (array for the attribute part)

  "payload": {
    "data": {
      "field1": "field1Value",
      "field2": {
        "value": [
          {
            "attributeWantToBefield": "attribute1Value",
            "value": "25"
          },
          {
            "attributeWantToBefield": "attribute2Value",
            "value": "77"
          }
        ]
      },
      "field4": "1919-08-02T18:02:00"
    }
  }
}

Attached are message result (1 is the two xml files combined not formatted message, the tother 2 files are the formatted message )
result.formatted_fiel1.json.txt
result.formatted_file2.json.txt
result_to_reformat.json.txt

from the inferred schema you can see value is inferred to be struct type, and the other one is inferred as array.
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions