-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-48939][AVRO] Support reading Avro with recursive schema reference #47425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48939][AVRO] Support reading Avro with recursive schema reference #47425
Conversation
WweiL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left some comments
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
Outdated
Show resolved
Hide resolved
| * Adds support for recursive fields. If this option is not specified or is set to 0, recursive | ||
| * fields are not permitted. Setting it to 1 drops all recursive fields, 2 allows recursive | ||
| * fields to be recursed once, and 3 allows it to be recursed twice and so on, up to 15. | ||
| * Values larger than 15 are not allowed in order avoid inadvertently creating very large schemas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does protobuf also have max depth of 15?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this should be a spark conf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protobuf has max depth of 10 and it is hardcoded. 15 is used because some users have demand for up to 12 and 3 more is given as buffer. I agree that it will be better if users can increase the max depth at will. Since Protobuf also does not support it. This config can be added in a future PR for both Protobuf and Avro.
connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Outdated
Show resolved
Hide resolved
| false, | ||
| "") | ||
| "", | ||
| -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just add a default value in the definition to prevent multiple API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I have thought about it. I did not add default value for two reasons. First is some newly added options (stableIdPrefixForUnionType) did not specify a default value either. Second is there are two constructors for the class, one with 5 arguements, the other with 7 arguements, if we were adding default values for both of them, there will be a clash of definition which is confusing.
connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala
Outdated
Show resolved
Hide resolved
| throw new IncompatibleSchemaException(s""" | ||
| |Found recursive reference in Avro schema, which can not be processed by Spark: | ||
| |${avroSchema.toString(true)} | ||
| |Found recursive reference in Avro schema, which can not be processed by Spark by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC protobuf does similar here. But this logic looks a bit weird. If we do want to limit the max recursive depth, I feel that it should be checked in the option and throw IllegalArgumentException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
| """.stripMargin) | ||
| } | ||
|
|
||
| private def checkSparkSchemaEquals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have some boundary checks (< 0, > 15)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value can be negative. I will add a test for > 15 case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in the integration test to minimize code.
| val recursiveFieldMaxDepth: Int = | ||
| parameters.get(RECURSIVE_FIELD_MAX_DEPTH).map(_.toInt).getOrElse(-1) | ||
|
|
||
| if (recursiveFieldMaxDepth > 15) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 15 the max depth we allow? Can we using a constant value like RECURSIVE_FIELD_DEPTH_LIMIT to represent it?
| parameters.get(RECURSIVE_FIELD_MAX_DEPTH).map(_.toInt).getOrElse(-1) | ||
|
|
||
| if (recursiveFieldMaxDepth > 15) { | ||
| throw new IllegalArgumentException(s"Valid range of $RECURSIVE_FIELD_MAX_DEPTH is 0 - 15.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we follow the error class strategy to classify this error? You can refer to something like this, but consider creating your own type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala
Outdated
Show resolved
Hide resolved
WweiL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
bogao007
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, thanks for adding this support! Left a comment regarding the doc.
| } | ||
| } else if (recursiveDepth > 0 && recursiveDepth >= recursiveFieldMaxDepth) { | ||
| logInfo( | ||
| log"The field ${MDC(FIELD_NAME, avroSchema.getFullName)} of type " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behavior for Protobuf? Do we drop the fields or do we throw errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protobuf also drops the field and logs the action.
| <td>read</td> | ||
| <td>4.0.0</td> | ||
| </tr> | ||
| <tr> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the doc for the new option! Apart from this, should we add a block on recursion support similar to what we do for protobuf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
bogao007
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@cloud-fan Could you please review it? Thanks! |
|
cc @HeartSaVioR and @rangadi |
|
@gengliangwang Would you mind helping reviewing the change as you've been one of the main reviewers for Avro? I can give a try, but I don't feel like I'm qualified to review and sign-off. |
|
Friendly reminder, @gengliangwang |
|
@HeartSaVioR Thanks for the ping. I will find time to review this one recently. |
| messageParameters, | ||
| cause) | ||
|
|
||
| object AvroOptionsError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to QueryCompilationErrors. There are some Avro errors in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @hkulyc
| stableIdPrefixForUnionType: String): SchemaType = { | ||
| toSqlTypeHelper(avroSchema, Set.empty, useStableIdForUnionType, stableIdPrefixForUnionType) | ||
| stableIdPrefixForUnionType: String, | ||
| recursiveFieldMaxDepth: Int = -1): SchemaType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add code comment for each parameter and explain -1 means not supported.
| else { | ||
| StructField(f.name, schemaType.dataType, schemaType.nullable) | ||
| } | ||
| }.filter(_ != null).toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to keep the null fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to drop them because StructType does not take an array that has null values. One alternative is to wrap the null value with StructField(..., nullable = true), but we are not doing it for Protobuf. In Protobuf, we just directly drop them.
| } else { | ||
| s"member$i" | ||
| } | ||
| val fieldName = if (useStableIdForUnionType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test case for this code branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are talking about useStableIdForUnionType, the code was originally added by another PR: #44964. And I believe the test is already included in that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR only wraps this block with a condition checking whether the child type is null or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the recursive union schema with more than 2 non-null fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursive and non-recursive union schema shares this code branch. For non-recursive schema we have this test that covers this branch.
spark/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala
Line 290 in 3305939
| test("SPARK-48545: from_avro and to_avro SQL functions") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hkulyc yes, what about the recursive union schema?
For example
{
"type": "record",
"name": "TreeNode",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "value",
"type": [
"string",
"int"
]
},
{
"name": "children",
"type": [
"null",
{
"type": "array",
"items": "TreeNode"
}
],
"default": null
}
]
}
Do we have a test case for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a test case for a simplified version of this:
| test("Translate recursive schema - union") { |
Do you think it is sufficient? Thanks for pointing this out. @gengliangwang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is sufficient?
No, it doesn't covert the case when the length of non-null fields > 1
Continue the discussion from #47425 to this PR because I can't push to Yuchen's account ### What changes were proposed in this pull request? The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth. ### Why are the changes needed? Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: ``` { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } ``` This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it. ### Does this PR introduce any user-facing change? Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth. ### How was this patch tested? Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Co-authored-by: Wei Liu <wei.liudatabricks.com> Closes #48043 from WweiL/yuchen-avro-recursive-schema. Lead-authored-by: Yuchen Liu <[email protected]> Co-authored-by: Wei Liu <[email protected]> Co-authored-by: Yuchen Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
|
With this one merged #48043 |
Continue the discussion from apache#47425 to this PR because I can't push to Yuchen's account ### What changes were proposed in this pull request? The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth. ### Why are the changes needed? Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: ``` { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } ``` This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it. ### Does this PR introduce any user-facing change? Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth. ### How was this patch tested? Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Co-authored-by: Wei Liu <wei.liudatabricks.com> Closes apache#48043 from WweiL/yuchen-avro-recursive-schema. Lead-authored-by: Yuchen Liu <[email protected]> Co-authored-by: Wei Liu <[email protected]> Co-authored-by: Yuchen Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
Continue the discussion from apache#47425 to this PR because I can't push to Yuchen's account ### What changes were proposed in this pull request? The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth. ### Why are the changes needed? Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: ``` { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } ``` This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it. ### Does this PR introduce any user-facing change? Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth. ### How was this patch tested? Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Co-authored-by: Wei Liu <wei.liudatabricks.com> Closes apache#48043 from WweiL/yuchen-avro-recursive-schema. Lead-authored-by: Yuchen Liu <[email protected]> Co-authored-by: Wei Liu <[email protected]> Co-authored-by: Yuchen Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and
from_avrofunction. With this option, Spark can support Avro recursive schema up to certain depth.Why are the changes needed?
Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is:
This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it.
Does this PR introduce any user-facing change?
Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth.
How was this patch tested?
Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite.
Was this patch authored or co-authored using generative AI tooling?
No.