-
Couldn't load subscription status.
- Fork 1.7k
Description
In a platform I work on, I decided to write avro log files so I could easily close and append binary files to s3. Since I didn't want to bother transforming it to another format using Spark, which is the thing I wanted to drop in the first place, I started writing what's required to read avro as a datasource in datafusion.
Here is the branch on my fork (I merged the nested field PR in it but it can be removed) :
https://github.com/Igosuki/arrow-datafusion/tree/avro2_m
I transformed all parquet test files to avro and plan to add a test case for each of these.
My question would be is Avro support desirable for datafusion or should I just make a sidecar crate on my own ?
Describe alternatives you've considered
Transforming data in json or parquet to reuse the existing code.
Additional context
I'm new to the new arrow data types, and it's been a challenge to find out what I should do with avro union types that are just a nullable field. Ultimately I decided to make them nullable fields and drop the union, but I had to add special cases here and there because of that.