Skip to content

Conversation

@mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Jun 14, 2018

What changes were proposed in this pull request?

The PR proposes to add support for running the same SQL test input files against different configs leading to the same result.

How was this patch tested?

Involved UTs

@mgaido91
Copy link
Contributor Author

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91859 has finished for PR 21568 at commit ed01ff0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -0,0 +1,193 @@
-- Automatically generated by SQLQueryTestSuite
Copy link
Member

@maropu maropu Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new feature looks good to me. Btw, if we set multiple configurations, filenames get too long? How about adding the configuration info. in the head of files? Then, filenames are;

./decimalArithmeticOperations.sql.out.1
./decimalArithmeticOperations.sql.out.2
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have two cases:
1 - Different configs produce different results (so different files) and in this case your suggestion is fine;
2 - Different configs produce the same results (so we have one golden file for all of them), how would you address this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the second case also output multiple same result files for each config.

// these files have the same result
subquery/in-subquery/in-joins.sql.out.1 <- sort-merge joins
subquery/in-subquery/in-joins.sql.out.2 <- hash joins

"typeCoercion/native/decimalArithmeticOperations.sql" ->
(Seq(Seq(SQLConf.DECIMAL_OPERATIONS_ALLOW_PREC_LOSS.key -> "true"),
Seq(SQLConf.DECIMAL_OPERATIONS_ALLOW_PREC_LOSS.key -> "false")) -> false),
"subquery/in-subquery/in-joins.sql" -> configsAllJoinTypes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry and the others blow don't change existing test result output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, changing these configs should not change the output.

@rxin
Copy link
Contributor

rxin commented Jun 16, 2018

I'm confused by the description. What does this PR actually do?

@maropu
Copy link
Member

maropu commented Jun 16, 2018

IIUC this pr modified code to run a single test file (in SQLQueryTestSuite) multiple times with different configurations.

@mgaido91
Copy link
Contributor Author

@rxin the PR does what was suggested in these 2 comments #20023 (comment) and #21529 (comment). Basically we want to run the same SQL test file with different configs.

We have two cases:

  • Running with different configs produces different output (so different golden files);
  • Running with different configs produces the same output (so we have only one golden file) but the tests are run against different configs.

The goals are to avoid to copy and paste the same queries after setting different configurations (as it was done in decimalArithmeticOperations) and to be able to improve test coverage for the joins (because with default configs we basically always execute broadcast joins).

@maropu
Copy link
Member

maropu commented Jun 16, 2018

We need to strictly handle the second case, too? If we accepted the same output files for that case, we could have simpler output file name rules as I described here?

@mgaido91
Copy link
Contributor Author

Sorry, @maropu, I didn't get want you mean by

If we accepted the same output files for that case

may you please explain me?

Anyway, the problem with that proposal is not really about the filenames, but it is about adding the configs inside the files as comments. Because if we have the same output file we cannot include them in the header...

// We should ignore this file from processing.
)

private val configsAllJoinTypes = Seq(Seq(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment describing what each config set means for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I am not sure what you mean exactly. Every config has it is own description which explains what it is intended for. Did you mean something different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I meant that here you definitely want to test against different join types, maybe it is good if you can describe which join type each config set means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, got it, thanks. I am not sure as we do not usually do so in other test suites. @cloud-fan @maropu what do you think?

private def testCases(inputPath: String): Seq[TestCase] = {
val baseResultFileName = inputPath.replace(inputFilePath, goldenFilePath)
val testCaseName = inputPath.stripPrefix(inputFilePath).stripPrefix(File.separator)
testConfigs.get(testCaseName) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a test is renamed, it might silently turn to default test config. To avoid that, maybe we should explicitly define which test case needs to run against custom configs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmh... Well, actually it is not very silently, as you would see it executes only once it the test runs and not many times with different suffixes. But of course it won't fail. @cloud-fan @maropu what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to take care of the issue, how about listing up all the test files inside SQLQueryTestSuite.scala? Then, if developers add/delete/rename test files, they need to update the list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I do not really like this idea @maropu, it puts an extra effort which is not really needed...

@maropu
Copy link
Member

maropu commented Jun 18, 2018

In the same result case, I'm worried that we cannot easily understand which SQL configs cause failures?
IMHO withSQLConf has the same issue, too;

Seq(true, false).foreach { flag =>
  withSQLConf("spark.sql.anyFlag" -> flag.toString) {
    ...
  }
}

In this test case (common patterns?), we cannot understand which case (true or false) causes the failure at first glance. For example, can we use withClue to solve this?
master...maropu:AddConfigInfoInException

@mgaido91
Copy link
Contributor Author

@maropu I think we can, instead. Since the configs are put as suffix in the test case name (this happens also in the same result case) you know which configs failed.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jul 9, 2018

kindly ping @cloud-fan

@rxin
Copy link
Contributor

rxin commented Jul 9, 2018

I think it's super confusing to have the config names encoded in file names. Makes the names super long and difficult to read, and also hard to verify what was set, and difficult to get multiple configs.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jul 9, 2018

@rxin then we can do what @maropu suggested, i.e. adding a numeric suffix and maybe logging the used configs so even though we.don't have them in the test name we can anyway know which of them is failing. What do you think? Do you have a better proposal?

@rxin
Copy link
Contributor

rxin commented Jul 9, 2018

Can you just define a config matrix in the beginning of the file, and each file is run with the config matrix?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jul 9, 2018

Sorry,what do you mean by a config matrix? And how would we discriminate whether each config should produce the same result or they should produce different ones?

@rxin
Copy link
Contributor

rxin commented Jul 9, 2018

If they produce different results why do you need any infrastructure for them? They are just part of the normal test flow.

If they produce the same result, and you don't want to define the same test queries twice, we can create an infra for that. I thought that's what this is about?

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jul 9, 2018

The goal here was to address both cases. The need came out in previous PRs in order to avoid to copy and paste the same queries in order to test them with different configs. So the idea here was to have an infra which covers both cases in order to avoid this copy-and paste (see decimalOperations.sql for instance).

@rxin
Copy link
Contributor

rxin commented Jul 9, 2018

What are the use cases other than decimal? I am not sure if we need to build a lot of infrastructure just for one or two use cases.

@mgaido91
Copy link
Contributor Author

mgaido91 commented Jul 9, 2018

The main other case here was to improve also the coverage for the join operations, because with the default values we are testing only the broadcast join.

@cloud-fan
Copy link
Contributor

It's good to cover both of the cases in one design, but I'd like to prioritize the join one.

I feel it's common to try with different optimization/runtime configs and make sure we get corrected result. It's more important than the decimal one that just saves some typing.

Seems it's hard to reach a consensus of a good design to cover both of the cases, how about we just do the join one? i.e. a SQL test file can specify a config matrix(we need to design a syntax for it), and the test framework should run this test file with specified configs and their values to make sure the results all match the golden file.

@mgaido91
Copy link
Contributor Author

I am not sure it is a great idea to do only one of the 2 scenarios, if we plan to later include both them as we might have to redo the same work twice. But if you all agree on this plan I'll stick to it.

@cloud-fan
Copy link
Contributor

We can deal with the decimal test file specially if that's the only use case. For now I'd say the join test is more important and let's finish it first.

@rxin
Copy link
Contributor

rxin commented Jul 10, 2018 via email

@mgaido91
Copy link
Contributor Author

@cloud-fan @rxin I updated the PR in order to handle only the case in which we have the same result for different configs and the configs are specified in the SQL files.
In order to make clear which config caused a job error I added a ERROR message in the logs.

Do you all agree with this approach?

@SparkQA
Copy link

SparkQA commented Jul 11, 2018

Test build #92857 has finished for PR 21568 at commit 6f90e63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

runQueries(queries, testCase.resultFile, None)
} else {
configSets.foreach { configSet =>
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to do the try-catch inside runQueries, so that we can know which config cause a failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh, but we know which are the configs causing a failure also here (and we are logging them). Am I missing something from your comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I misread the code.

@cloud-fan
Copy link
Contributor

LGTM

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 592cc84 Jul 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants