[SPARK-48939][AVRO] Support reading Avro with recursive schema reference #47425

eason-yuchen-liu · 2024-07-19T17:14:17Z

What changes were proposed in this pull request?

The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth.

Why are the changes needed?

Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is:

{
  "type": "record",
  "name": "LongList",
  "fields" : [
    {"name": "value", "type": "long"},
    {"name": "next", "type": ["null", "LongList"]}
  ]
}

This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it.

Does this PR introduce any user-facing change?

Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth.

How was this patch tested?

Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

eason-yuchen-liu · 2024-07-23T18:10:24Z

@WweiL @bogao007 PTAL. Thanks!

WweiL

Thanks! Left some comments

connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

WweiL · 2024-07-24T07:16:24Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

+   * Adds support for recursive fields. If this option is not specified or is set to 0, recursive
+   * fields are not permitted. Setting it to 1 drops all recursive fields, 2 allows recursive
+   * fields to be recursed once, and 3 allows it to be recursed twice and so on, up to 15.
+   * Values larger than 15 are not allowed in order avoid inadvertently creating very large schemas.


Does protobuf also have max depth of 15?

I feel this should be a spark conf

Protobuf has max depth of 10 and it is hardcoded. 15 is used because some users have demand for up to 12 and 3 more is given as buffer. I agree that it will be better if users can increase the max depth at will. Since Protobuf also does not support it. This config can be added in a future PR for both Protobuf and Avro.

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

WweiL · 2024-07-24T07:35:28Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSerdeSuite.scala

        false,
-        "")
+        "",
+        -1)


Should we just add a default value in the definition to prevent multiple API change?

Yeah I have thought about it. I did not add default value for two reasons. First is some newly added options (stableIdPrefixForUnionType) did not specify a default value either. Second is there are two constructors for the class, one with 5 arguements, the other with 7 arguements, if we were adding default values for both of them, there will be a clash of definition which is confusing.

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

WweiL · 2024-07-24T07:41:08Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

          throw new IncompatibleSchemaException(s"""
-            |Found recursive reference in Avro schema, which can not be processed by Spark:
-            |${avroSchema.toString(true)}
+            |Found recursive reference in Avro schema, which can not be processed by Spark by


IIRC protobuf does similar here. But this logic looks a bit weird. If we do want to limit the max recursive depth, I feel that it should be checked in the option and throw IllegalArgumentException

WweiL · 2024-07-24T18:18:19Z

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

    """.stripMargin)
  }

+  private def checkSparkSchemaEquals(


Can we also have some boundary checks (< 0, > 15)

The value can be negative. I will add a test for > 15 case.

I put it in the integration test to minimize code.

bogao007 · 2024-07-24T18:14:12Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

+  val recursiveFieldMaxDepth: Int =
+    parameters.get(RECURSIVE_FIELD_MAX_DEPTH).map(_.toInt).getOrElse(-1)
+
+  if (recursiveFieldMaxDepth > 15) {


Is 15 the max depth we allow? Can we using a constant value like RECURSIVE_FIELD_DEPTH_LIMIT to represent it?

bogao007 · 2024-07-24T18:15:14Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

+    parameters.get(RECURSIVE_FIELD_MAX_DEPTH).map(_.toInt).getOrElse(-1)
+
+  if (recursiveFieldMaxDepth > 15) {
+    throw new IllegalArgumentException(s"Valid range of $RECURSIVE_FIELD_MAX_DEPTH is 0 - 15.")


Can we follow the error class strategy to classify this error? You can refer to something like this, but consider creating your own type.

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

WweiL

LGTM!

bogao007

LGTM overall, thanks for adding this support! Left a comment regarding the doc.

bogao007 · 2024-07-24T23:18:27Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

-        }
+        } else if (recursiveDepth > 0 && recursiveDepth >= recursiveFieldMaxDepth) {
+          logInfo(
+            log"The field ${MDC(FIELD_NAME, avroSchema.getFullName)} of type " +


What's the behavior for Protobuf? Do we drop the fields or do we throw errors?

Protobuf also drops the field and logs the action.

bogao007 · 2024-07-24T23:30:55Z

docs/sql-data-sources-avro.md

    <td>read</td>
    <td>4.0.0</td>
  </tr>
+  <tr>


Thanks for adding the doc for the new option! Apart from this, should we add a block on recursion support similar to what we do for protobuf?

bogao007

LGTM

eason-yuchen-liu · 2024-07-29T20:34:36Z

@cloud-fan Could you please review it? Thanks!

HyukjinKwon · 2024-08-02T01:34:26Z

cc @HeartSaVioR and @rangadi

HeartSaVioR · 2024-08-06T03:06:58Z

@gengliangwang Would you mind helping reviewing the change as you've been one of the main reviewers for Avro? I can give a try, but I don't feel like I'm qualified to review and sign-off.

HeartSaVioR · 2024-08-09T13:24:11Z

Friendly reminder, @gengliangwang

gengliangwang · 2024-08-09T14:03:01Z

@HeartSaVioR Thanks for the ping. I will find time to review this one recently.

WweiL · 2024-08-15T00:08:57Z

Thanks for the review routing! When it's convenient, can people assign this PR to @WweiL and @hkulyc so that we can keep track of reviews? Thanks again!

gengliangwang · 2024-08-15T00:32:57Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala

+    messageParameters,
+    cause)
+
+object AvroOptionsError {


Let's move this to QueryCompilationErrors. There are some Avro errors in it.

gengliangwang · 2024-08-15T00:35:13Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

-      stableIdPrefixForUnionType: String): SchemaType = {
-    toSqlTypeHelper(avroSchema, Set.empty, useStableIdForUnionType, stableIdPrefixForUnionType)
+      stableIdPrefixForUnionType: String,
+      recursiveFieldMaxDepth: Int = -1): SchemaType = {


let's add code comment for each parameter and explain -1 means not supported.

gengliangwang · 2024-08-15T00:38:51Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+            else {
+              StructField(f.name, schemaType.dataType, schemaType.nullable)
+            }
+          }.filter(_ != null).toSeq


Don't we need to keep the null fields?

We need to drop them because StructType does not take an array that has null values. One alternative is to wrap the null value with StructField(..., nullable = true), but we are not doing it for Protobuf. In Protobuf, we just directly drop them.

gengliangwang · 2024-08-15T04:43:42Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

                } else {
-                  s"member$i"
-                }
+                  val fieldName = if (useStableIdForUnionType) {


Do we have a test case for this code branch?

If you are talking about useStableIdForUnionType, the code was originally added by another PR: #44964. And I believe the test is already included in that PR.

This PR only wraps this block with a condition checking whether the child type is null or not.

I meant the recursive union schema with more than 2 non-null fields.

Recursive and non-recursive union schema shares this code branch. For non-recursive schema we have this test that covers this branch.

spark/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroFunctionsSuite.scala

Line 290 in 3305939

test("SPARK-48545: from_avro and to_avro SQL functions") {

@hkulyc yes, what about the recursive union schema?
For example

{ "type": "record", "name": "TreeNode", "fields": [ { "name": "name", "type": "string" }, { "name": "value", "type": [ "string", "int" ] }, { "name": "children", "type": [ "null", { "type": "array", "items": "TreeNode" } ], "default": null } ] }

Do we have a test case for that?

We have a test case for a simplified version of this:

spark/connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

Line 2271 in ac5a940

test("Translate recursive schema - union") {

Do you think it is sufficient? Thanks for pointing this out. @gengliangwang

Do you think it is sufficient?

No, it doesn't covert the case when the length of non-null fields > 1

Continue the discussion from #47425 to this PR because I can't push to Yuchen's account ### What changes were proposed in this pull request? The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth. ### Why are the changes needed? Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: ``` { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } ``` This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it. ### Does this PR introduce any user-facing change? Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth. ### How was this patch tested? Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Co-authored-by: Wei Liu <wei.liudatabricks.com> Closes #48043 from WweiL/yuchen-avro-recursive-schema. Lead-authored-by: Yuchen Liu <[email protected]> Co-authored-by: Wei Liu <[email protected]> Co-authored-by: Yuchen Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

WweiL · 2024-09-24T23:18:16Z

With this one merged #48043
This PR can be closed. Thanks for everyone who worked / reviewed this PR!

Continue the discussion from apache#47425 to this PR because I can't push to Yuchen's account ### What changes were proposed in this pull request? The builtin ProtoBuf connector first supports recursive schema reference. It is approached by letting users specify an option “recursive.fields.max.depth”, and at the start of the execution, unroll the recursive field by this level. It converts a problem of dynamic schema for each row to a fixed schema which is supported by Spark. Avro can just adopt a similar method. This PR defines an option "recursiveFieldMaxDepth" to both Avro data source and from_avro function. With this option, Spark can support Avro recursive schema up to certain depth. ### Why are the changes needed? Recursive reference denotes the case that the type of a field can be defined before in the parent nodes. A simple example is: ``` { "type": "record", "name": "LongList", "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["null", "LongList"]} ] } ``` This is written in Avro Schema DSL and represents a linked list data structure. Spark currently will throw an error on this schema. Many users used schema like this, so we should support it. ### Does this PR introduce any user-facing change? Yes. Previously, it will throw error on recursive schemas like above. With this change, it will still throw the same error by default but when users specify the option to a number greater than 0, the schema will be unrolled to that depth. ### How was this patch tested? Added new unit tests and integration tests to AvroSuite and AvroFunctionSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Co-authored-by: Wei Liu <wei.liudatabricks.com> Closes apache#48043 from WweiL/yuchen-avro-recursive-schema. Lead-authored-by: Yuchen Liu <[email protected]> Co-authored-by: Wei Liu <[email protected]> Co-authored-by: Yuchen Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

github-actions · 2025-01-03T00:24:26Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

eason-yuchen-liu and others added 12 commits July 16, 2024 11:36

support in main function

2a67fb9

Add recursiveFieldMaxDepth option

7dc41e2

add unit tests and integration test

3d8d99e

Merge branch 'apache:master' into avro-recursive-schema

59a69db

add integration tests

997ff7e

revert change to existing tests

a64a927

change the handling of max depth = 0 to align with ProtoBuf

bfcec5e

add one test for from_avro

e1a2051

add doc

29bb2ce

give a upper bound to maxDepth as Protobuf

1fd5a97

minor

eb0376b

minor

9fbdd55

github-actions bot added SQL AVRO labels Jul 19, 2024

eason-yuchen-liu added 2 commits July 22, 2024 10:58

add doc for the option

5219213

update the comment for the doc

7a6c17d

github-actions bot added the DOCS label Jul 22, 2024

eason-yuchen-liu added 2 commits July 22, 2024 15:55

change the upper limit to 12

026fd4a

increase to 15

ae13fd0

eason-yuchen-liu marked this pull request as ready for review July 23, 2024 17:15

delete dead code

ac2fe30

WweiL reviewed Jul 24, 2024

View reviewed changes

eason-yuchen-liu added 4 commits July 24, 2024 09:36

naming changes

f6859ce

Move input check to the options

5c902a5

Use the new logging framework

ff76725

minor

34d37ba

WweiL reviewed Jul 24, 2024

View reviewed changes

bogao007 reviewed Jul 24, 2024

View reviewed changes

WweiL reviewed Jul 24, 2024

View reviewed changes

connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala Show resolved Hide resolved

eason-yuchen-liu added 4 commits July 24, 2024 14:14

Use the new error framework & add test for wrong input

979dc65

sort LogKey

1c7ee68

create a constant for max limit

fbc7d12

minor

5569961

WweiL approved these changes Jul 24, 2024

View reviewed changes

bogao007 reviewed Jul 24, 2024

View reviewed changes

Add a section of example in documentation

ac5a940

eason-yuchen-liu requested a review from bogao007 July 25, 2024 18:10

bogao007 approved these changes Jul 25, 2024

View reviewed changes

gengliangwang reviewed Aug 15, 2024

View reviewed changes

WweiL mentioned this pull request Sep 9, 2024

[SPARK-48939][AVRO] Support reading Avro with recursive schema reference #48043

Closed

github-actions bot added the Stale label Jan 3, 2025

github-actions bot closed this Jan 4, 2025

[SPARK-48939][AVRO] Support reading Avro with recursive schema reference #47425

[SPARK-48939][AVRO] Support reading Avro with recursive schema reference #47425

Uh oh!

Conversation

eason-yuchen-liu commented Jul 19, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

eason-yuchen-liu commented Jul 23, 2024

Uh oh!

WweiL left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

WweiL left a comment

Choose a reason for hiding this comment

Uh oh!

bogao007 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogao007 left a comment

Choose a reason for hiding this comment

Uh oh!

eason-yuchen-liu commented Jul 29, 2024

Uh oh!

HyukjinKwon commented Aug 2, 2024

Uh oh!

HeartSaVioR commented Aug 6, 2024

Uh oh!

HeartSaVioR commented Aug 9, 2024

Uh oh!

gengliangwang commented Aug 9, 2024

Uh oh!

WweiL commented Aug 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hkulyc Aug 16, 2024 •

edited

Loading

hkulyc Aug 26, 2024 •

edited

Loading