Skip to content

encode(..., "hex") errors on non-UTF-8 binaries since Datafusion v43 #14055

@progval

Description

@progval

Describe the bug

encode(..., "hex") can be used to get the hexadecimal representation of a string or a binary. Since datafusion v43 (specifically, since 1b3608d, ie. #12308), only strings and binaries that happen to be valid UTF-8 are supported.

To Reproduce

vlorentz@maxxi:~/datafusion/datafusion-cli$ git checkout 1b3608da7ca59d8d987804834d004e8b3e349d18
HEAD is now at 1b3608da7 fix: coalesce schema issues (#12308)
vlorentz@maxxi:~/datafusion/datafusion-cli$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.27s
     Running `target/debug/datafusion-cli`
DataFusion CLI v42.0.0
> create table test ( foo bytea );
0 row(s) fetched. 
Elapsed 0.007 seconds.

> insert into test (foo) values (X'8f50d3f60eae370ddbf85c86219c55108a350165');
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched. 
Elapsed 0.006 seconds.

> EXPLAIN SELECT encode(foo, 'hex') FROM test;
+---------------+-----------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                    |
+---------------+-----------------------------------------------------------------------------------------+
| logical_plan  | Projection: encode(CAST(test.foo AS Utf8), Utf8("hex"))                                 |
|               |   TableScan: test projection=[foo]                                                      |
| physical_plan | ProjectionExec: expr=[encode(CAST(foo@0 AS Utf8), hex) as encode(test.foo,Utf8("hex"))] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                                         |
|               |                                                                                         |
+---------------+-----------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.007 seconds.

> SELECT encode(foo, 'hex') FROM test;
Arrow error: Invalid argument error: Encountered non UTF-8 data: invalid utf-8 sequence of 1 bytes from index 0
> 
\q

Expected behavior

vlorentz@maxxi:~/datafusion/datafusion-cli$ git checkout 1b3608da7ca59d8d987804834d004e8b3e349d18^
Previous HEAD position was 1b3608da7 fix: coalesce schema issues (#12308)
HEAD is now at 9a3f8d115 Minor: Encapsulate type check in GroupValuesColumn, avoid panic (#12620)
vlorentz@maxxi:~/datafusion/datafusion-cli$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 53.01s
     Running `target/debug/datafusion-cli`
DataFusion CLI v42.0.0
> create table test ( foo bytea );
0 row(s) fetched. 
Elapsed 0.005 seconds.

> insert into test (foo) values (X'8f50d3f60eae370ddbf85c86219c55108a350165');
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

> EXPLAIN SELECT encode(foo, 'hex') FROM test;
+---------------+---------------------------------------------------------------------------+
| plan_type     | plan                                                                      |
+---------------+---------------------------------------------------------------------------+
| logical_plan  | Projection: encode(test.foo, Utf8("hex"))                                 |
|               |   TableScan: test projection=[foo]                                        |
| physical_plan | ProjectionExec: expr=[encode(foo@0, hex) as encode(test.foo,Utf8("hex"))] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                           |
|               |                                                                           |
+---------------+---------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.005 seconds.

> SELECT encode(foo, 'hex') FROM test;
+------------------------------------------+
| encode(test.foo,Utf8("hex"))             |
+------------------------------------------+
| 8f50d3f60eae370ddbf85c86219c55108a350165 |
+------------------------------------------+
1 row(s) fetched. 
Elapsed 0.004 seconds.

> 
\q

Additional context

note CAST(test.foo AS Utf8) as part of the first query plan, which does not happen in the second one.

cc @mesejo

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is neededregressionSomething that used to work no longer does

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions