Skip to content

Querying Parquet file specifically with a predicate returns invalid data error but works in other situations #14281

@senyosimpson

Description

@senyosimpson

Describe the bug

When making a query with a predicate against Parquet files generated with parquet-go , DataFusion errors saying the data is invalid. However, without a predicate, it works fine.

When using the CLI, I get the error:

» datafusion-cli --command "select * from 'go-parquet-writer/go-testfile.parquet' where age > 10"
DataFusion CLI v44.0.0
Error: External error: Parquet error: External: bad data

In my application, it is more descriptive, showing:

ParquetError(External(ProtocolError { kind: InvalidData, message: "cannot convert 2 into TType" }))

However, it appears that the file is intact. The metadata is successfully read and interpreted

» datafusion-cli --command "describe 'go-parquet-writer/go-testfile.parquet'"
DataFusion CLI v44.0.0
+---------------+-------------------------------------+-------------+
| column_name   | data_type                           | is_nullable |
+---------------+-------------------------------------+-------------+
| city          | Utf8View                            | NO          |
| country       | Utf8View                            | NO          |
| age           | UInt8                               | NO          |
| scale         | Int16                               | NO          |
| status        | UInt32                              | NO          |
| time_captured | Timestamp(Millisecond, Some("UTC")) | NO          |
| checked       | Boolean                             | NO          |
+---------------+-------------------------------------+-------------+
7 row(s) fetched.
Elapsed 0.001 seconds.

When I run without a predicate, I get back the data

» datafusion-cli --command "select * from 'go-parquet-writer/go-testfile.parquet'"
DataFusion CLI v44.0.0
+--------+---------+-----+-------+--------+--------------------------+---------+
| city   | country | age | scale | status | time_captured            | checked |
+--------+---------+-----+-------+--------+--------------------------+---------+
| Madrid | Spain   | 10  | -1    | 12     | 2025-01-24T16:34:00.715Z | false   |
| Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true    |
+--------+---------+-----+-------+--------+--------------------------+---------+
2 row(s) fetched.
Elapsed 0.002 seconds.

It even works if I use ORDER BY and GROUP BY

» datafusion-cli --command "select * from 'go-parquet-writer/go-testfile.parquet' ORDER BY age DESC"
DataFusion CLI v44.0.0
+--------+---------+-----+-------+--------+--------------------------+---------+
| city   | country | age | scale | status | time_captured            | checked |
+--------+---------+-----+-------+--------+--------------------------+---------+
| Athens | Greece  | 32  | 1     | 20     | 2025-01-24T17:34:00.715Z | true    |
| Madrid | Spain   | 10  | -1    | 12     | 2025-01-24T16:34:00.715Z | false   |
+--------+---------+-----+-------+--------+--------------------------+---------+
2 row(s) fetched.
Elapsed 0.010 seconds.

» datafusion-cli --command "select city, SUM(age) AS age from 'go-parquet-writer/go-testfile.parquet' GROUP BY city"
DataFusion CLI v44.0.0
+--------+-----+
| city   | age |
+--------+-----+
| Athens | 32  |
| Madrid | 10  |
+--------+-----+
2 row(s) fetched.
Elapsed 0.004 seconds.

Additionally, this works when I use PyArrow and Pandas to load the Parquet file and filter it.

To Reproduce

The issue can be reproduced by creating a Parquet file with the parquet-go library and attempting to query it with a predicate in the query. To simplify, I created a public repo that has code to generate the file and similar examples in the README as shown in this report. A test file can be found in go-parquet-writer/go-testfile.parquet, generated by the Go program in that directory.

I've also gone through the effort of trying to achieve the same using PyArrow and Pandas (which you'll see in the repo under pyarrow-ex) to verify the Parquet file is not corrupted in some way. This works as expected.

Expected behavior

The Parquet files created by parquet-go can successfully be queried when the query contains a predicate.

Additional context

From everything I've gathered, this error is likely coming from this conversion function. However, it only skips checking 0x02 when a collection is being parsed. Weirdly, I don't have any list/map/set in my schema. I assume this means this 0x02 is being used to encode something else but it is beyond my knowledge.

I went spelunking in parquet-go codebase. The Thrift protocol implementation is split amongst the compact protocol, the Thrift type definitions and the encoding logic

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions