Skip to content

Conversation

armitage420
Copy link
Contributor

@armitage420 armitage420 commented Sep 15, 2025

What changes were proposed in this pull request?

Added explicit order by in queries

Why are the changes needed?

The test in itself uses SORT_QUERY_RESULTS to keep a deterministic ordering of queries. But there is still scope of non determinism, as SORT_QUERY_RESULTS sorts the output of each query lexicographically on unmasked rows, and if the present masked values change, then the output ordering changes as well. Hence, we need to add explicit order by on queries.

Does this PR introduce any user-facing change?

No

How was this patch tested?

q.out file results, and test pipeline

Copy link

github-actions bot commented Sep 15, 2025

@check-spelling-bot Report

🔴 Please review

See the files view or the action log for details.

Unrecognized words (3)

bucketedtables
languagemanual
teradatabinaryserde

Previously acknowledged words that are now absent aarry bytecode cwiki HIVEFETCHOUTPUTSERDE timestamplocal yyyy
To accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands

... in a clone of the [email protected]:armitage420/hive.git repository
on the flakyTest branch:

update_files() {
perl -e '
my @expect_files=qw('".github/actions/spelling/expect.txt"');
@ARGV=@expect_files;
my @stale=qw('"$patch_remove"');
my $re=join "|", @stale;
my $suffix=".".time();
my $previous="";
sub maybe_unlink { unlink($_[0]) if $_[0]; }
while (<>) {
if ($ARGV ne $old_argv) { maybe_unlink($previous); $previous="$ARGV$suffix"; rename($ARGV, $previous); open(ARGV_OUT, ">$ARGV"); select(ARGV_OUT); $old_argv = $ARGV; }
next if /^(?:$re)(?:(?:\r|\n)*$| .*)/; print;
}; maybe_unlink($previous);'
perl -e '
my $new_expect_file=".github/actions/spelling/expect.txt";
use File::Path qw(make_path);
use File::Basename qw(dirname);
make_path (dirname($new_expect_file));
open FILE, q{<}, $new_expect_file; chomp(my @words = <FILE>); close FILE;
my @add=qw('"$patch_add"');
my %items; @items{@words} = @words x (1); @items{@add} = @add x (1);
@words = sort {lc($a)."-".$a cmp lc($b)."-".$b} keys %items;
open FILE, q{>}, $new_expect_file; for my $word (@words) { print FILE "$word\n" if $word =~ /\w/; };
close FILE;
system("git", "add", $new_expect_file);
'
}

comment_json=$(mktemp)
curl -L -s -S \
-H "Content-Type: application/json" \
"https://api.github.com/repos/apache/hive/issues/comments/3291850494" > "$comment_json"
comment_body=$(mktemp)
jq -r ".body // empty" "$comment_json" > $comment_body
rm $comment_json

patch_remove=$(perl -ne 'next unless s{^</summary>(.*)</details>$}{$1}; print' < "$comment_body")

patch_add=$(perl -e '$/=undef; $_=<>; if (m{Unrecognized words[^<]*</summary>\n*```\n*([^<]*)```\n*</details>$}m) { print "$1" } elsif (m{Unrecognized words[^<]*\n\n((?:\w.*\n)+)\n}m) { print "$1" };' < "$comment_body")

update_files
rm $comment_body
git add -u
If the flagged items do not appear to be text

If items relate to a ...

  • well-formed pattern.

    If you can write a pattern that would match it,
    try adding it to the patterns.txt file.

    Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.

    Note that patterns can't match multiline strings.

  • binary file.

    Please add a file path to the excludes.txt file matching the containing file.

    File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.

    ^ refers to the file's path from the root of the repository, so ^README\.md$ would exclude README.md (on whichever branch you're using).

@armitage420 armitage420 changed the title [WIP] Flaky test HIVE-29201: Fix flaky test query_iceberg_metadata_of_unpartitioned_table.q Sep 15, 2025
Copy link

github-actions bot commented Sep 16, 2025

@check-spelling-bot Report

🔴 Please review

See the files view or the action log for details.

Unrecognized words (3)

bucketedtables
languagemanual
teradatabinaryserde

Previously acknowledged words that are now absent aarry bytecode cwiki HIVEFETCHOUTPUTSERDE timestamplocal yyyy
To accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands

... in a clone of the [email protected]:armitage420/hive.git repository
on the flakyTest branch:

update_files() {
perl -e '
my @expect_files=qw('".github/actions/spelling/expect.txt"');
@ARGV=@expect_files;
my @stale=qw('"$patch_remove"');
my $re=join "|", @stale;
my $suffix=".".time();
my $previous="";
sub maybe_unlink { unlink($_[0]) if $_[0]; }
while (<>) {
if ($ARGV ne $old_argv) { maybe_unlink($previous); $previous="$ARGV$suffix"; rename($ARGV, $previous); open(ARGV_OUT, ">$ARGV"); select(ARGV_OUT); $old_argv = $ARGV; }
next if /^(?:$re)(?:(?:\r|\n)*$| .*)/; print;
}; maybe_unlink($previous);'
perl -e '
my $new_expect_file=".github/actions/spelling/expect.txt";
use File::Path qw(make_path);
use File::Basename qw(dirname);
make_path (dirname($new_expect_file));
open FILE, q{<}, $new_expect_file; chomp(my @words = <FILE>); close FILE;
my @add=qw('"$patch_add"');
my %items; @items{@words} = @words x (1); @items{@add} = @add x (1);
@words = sort {lc($a)."-".$a cmp lc($b)."-".$b} keys %items;
open FILE, q{>}, $new_expect_file; for my $word (@words) { print FILE "$word\n" if $word =~ /\w/; };
close FILE;
system("git", "add", $new_expect_file);
'
}

comment_json=$(mktemp)
curl -L -s -S \
-H "Content-Type: application/json" \
"https://api.github.com/repos/apache/hive/issues/comments/3294726465" > "$comment_json"
comment_body=$(mktemp)
jq -r ".body // empty" "$comment_json" > $comment_body
rm $comment_json

patch_remove=$(perl -ne 'next unless s{^</summary>(.*)</details>$}{$1}; print' < "$comment_body")

patch_add=$(perl -e '$/=undef; $_=<>; if (m{Unrecognized words[^<]*</summary>\n*```\n*([^<]*)```\n*</details>$}m) { print "$1" } elsif (m{Unrecognized words[^<]*\n\n((?:\w.*\n)+)\n}m) { print "$1" };' < "$comment_body")

update_files
rm $comment_body
git add -u
If the flagged items do not appear to be text

If items relate to a ...

  • well-formed pattern.

    If you can write a pattern that would match it,
    try adding it to the patterns.txt file.

    Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.

    Note that patterns can't match multiline strings.

  • binary file.

    Please add a file path to the excludes.txt file matching the containing file.

    File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.

    ^ refers to the file's path from the root of the repository, so ^README\.md$ would exclude README.md (on whichever branch you're using).

Copy link

@deniskuzZ
Copy link
Member

I would rather not select columns that are going to be masked

@armitage420
Copy link
Contributor Author

armitage420 commented Sep 16, 2025

@deniskuzZ Not selecting masked columns is not feasible for this particular test, as the there are part of the column values are masked and not the column(related to metadata itself) as a whole.

@deniskuzZ
Copy link
Member

deniskuzZ commented Sep 17, 2025

@deniskuzZ Not selecting masked columns is not feasible for this particular test, as the there are part of the column values are masked and not the column(related to metadata itself) as a whole.

oh, ok.

@armitage420

if the present masked values change

why would they change?

@armitage420
Copy link
Contributor Author

@armitage420

if the present masked values change

why would they change?

Thank you for your time @deniskuzZ !

Total size properties might change with file format upgrade, and in our case, it's orc here. Here's the jira for reference: HIVE-25607

Followed by the above mentioned jira, there was another jira that introduced masking for the same reason in iceberg qfiles: HIVE-25658

@thomasrebele
Copy link
Contributor

thomasrebele commented Sep 18, 2025

I had a look at this flaky test, too. If you look at the expected query result, the first columns are the same, but the first different column is not sorted lexicographically:

0	hdfs://### HDFS PATH ###	ORC	0	#Masked#	378	...
0	hdfs://### HDFS PATH ###	ORC	0	#Masked#	365	...
0	hdfs://### HDFS PATH ###	ORC	0	#Masked#	374	...

The problem is that it is sorted on the original value of ### HDFS PATH ### and #Masked#, and after sorting the values are replaced.

Changing the query to make this deterministic is a workaround for this particular q file. A proposal for a more general fix: refactor the masking so that it is done before the sorting (out is a org.apache.hadoop.hive.common.io.SortPrintStream).

@armitage420
Copy link
Contributor Author

armitage420 commented Sep 18, 2025

@thomasrebele Thank you for your input!
You are correct—the lexicographical sorting is done on unmasked values. Therefore, a better (and more accurate) fix would be to apply masking before sorting the results.
Currently, sorting is performed for every single query, whereas masking is only applied at the end, once we have collected all query results for the entire qfile. To implement the actual fix, we would need to change the test architecture so that masking is done per query, followed by sorting.
I'm not sure if this approach would be agreed upon, but if suggested, I can implement it!

@deniskuzZ @thomasrebele Do let me know what both of you think!

@thomasrebele
Copy link
Contributor

I've been working on a draft of applying the masking before the sorting (in addition to applying the masking at the end of the processing) in https://github.com/thomasrebele/hive/tree/tr/HIVE-29201-v1. The design of FetchConverter makes it difficult to implement this cleanly. Alternatively, we could make FetchConverter an interface (and the old class would become FetchConverterImpl) to simplify the logic of LambdaFetchConverter. What do you think, @armitage420, @deniskuzZ?

@deniskuzZ
Copy link
Member

deniskuzZ commented Sep 19, 2025

i think masking in this specific test isn’t very effective, as it bypasses validation for several iceberg metadata fields
query_iceberg_metadata_of_partitioned_table.q doesn't suffer from the same issue?

@armitage420
Copy link
Contributor Author

i think masking in this specific test isn’t very effective, as it bypasses validation for several iceberg metadata fields query_iceberg_metadata_of_partitioned_table.q doesn't suffer from the same issue?

The masking is only done for HDFS paths, file_size_in_bytes and total file size of table properties, the masking doesn't really effect the validation of the test.

@deniskuzZ
Copy link
Member

deniskuzZ commented Sep 19, 2025

The masking is only done for HDFS paths, file_size_in_bytes and total file size of table properties, the masking doesn't really effect the validation of the test.

@armitage420 test adds some additional masking as well, try removing and see for yourself. Why do we mask row count instead of size_in_bytes?
https://iceberg.apache.org/docs/1.9.0/spark-queries/#all-data-files

0	hdfs://### HDFS PATH ###	ORC	0	5	        378	{1:7,2:30}	

0	hdfs://### HDFS PATH ###	ORC	0	#Masked#	378	{1:7,2:30}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants