Skip to content

Regression: Wrong result when there are 2 count(distinct) #9586

@NGA-TRAN

Description

@NGA-TRAN

Describe the bug

Wrong result when there are 2 count distinct in the select clause,

To Reproduce

I will share there file with @alamb because I cannot attached .parquet file here

Bug

-- BUG
SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY server_role, os, env, datacenter;
+---------------+------------+-------------+---------+------------+------------+
| servers_count | pool_count | server_role | os      | env        | datacenter |
+---------------+------------+-------------+---------+------------+------------+
| 1             | 1          | mesg        | windows | production | mn         |
| 2             | 2          | mesg        | windows | production | va         |  -- should have 4 servers_count and 3 pool_count
+---------------+------------+-------------+---------+------------+------------+

Here are other queries running in DF CLI that tell me the right results

SELECT DISTINCT host, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' order by datacenter;
+---------------+-------------+---------+------------+------------+
| host          | server_role | os      | env        | datacenter |
+---------------+-------------+---------+------------+------------+
| mpm-mesg1002c | mesg        | windows | production | mn         |
| vpm-mesg1001c | mesg        | windows | production | va         |
| vpm-mesg1005b | mesg        | windows | production | va         |
| vpm-mesg1008d | mesg        | windows | production | va         |
| vpm-mesg1007b | mesg        | windows | production | va         |
+---------------+-------------+---------+------------+------------+

SELECT DISTINCT pool, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' order by datacenter;
+------+-------------+---------+------------+------------+
| pool | server_role | os      | env        | datacenter |
+------+-------------+---------+------------+------------+
| c    | mesg        | windows | production | mn         |
| b    | mesg        | windows | production | va         |
| c    | mesg        | windows | production | va         |
| d    | mesg        | windows | production | va         |
+------+-------------+---------+------------+------------+

SELECT  COUNT(DISTINCT host) AS servers_count, "server_role", "os", "env", "datacenter" from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY "server_role", "os", "env", "datacenter";
+---------------+-------------+---------+------------+------------+
| servers_count | server_role | os      | env        | datacenter |
+---------------+-------------+---------+------------+------------+
| 4             | mesg        | windows | production | va         |
| 1             | mesg        | windows | production | mn         |
+---------------+-------------+---------+------------+------------+

SELECT  count(distinct pool) as pool_count, "server_role", "os", "env", "datacenter" from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY "server_role", "os", "env", "datacenter";
+------------+-------------+---------+------------+------------+
| pool_count | server_role | os      | env        | datacenter |
+------------+-------------+---------+------------+------------+
| 1          | mesg        | windows | production | mn         |
| 3          | mesg        | windows | production | va         |
+------------+-------------+---------+------------+------------+

Expected behavior

SELECT  COUNT(DISTINCT host) AS servers_count, count(distinct pool) as pool_count, server_role, os, env, datacenter from '/tmp/file.parquet' WHERE time >= '2024-02-25T00:00:00Z' and time < '2024-02-25T00:00:01Z' and server_role = 'mesg' GROUP BY server_role, os, env, datacenter;
+---------------+------------+-------------+---------+------------+------------+
| servers_count | pool_count | server_role | os      | env        | datacenter |
+---------------+------------+-------------+---------+------------+------------+
| 1             | 1          | mesg        | windows | production | mn         |
| 4             | 3          | mesg        | windows | production | va         |  
+---------------+------------+-------------+---------+------------+------------+

### Additional context

_No response_

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingregressionSomething that used to work no longer does

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions