-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf #32243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WDYT? If no one disagrees with this, I'll file jira for it. @juliuszsompolski @npoggi @wangyum @HyukjinKwon @yaooqinn |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I like this idea. cc @gatorsmile too FYI
.github/workflows/build_and_test.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part above was copied from https://github.com/apache/spark/pull/31303/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2 cc: @wangyum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR updated the golden files because datagen used to generate maropu/spark-tpcds-sf-1 was older than the databricks/tpcds-kit one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it seems that we had better split this PR into two. WDYT, @maropu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, sgtm. We can split it by checking out my repo instead: https://github.com/maropu/spark-tpcds-datagen/tree/master/thirdparty/tpcds-kit/tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it later, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I've fixed the code so that it generate the same TPC-DS data with https://github.com/maropu/spark-tpcds-sf-1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the logic to parse options because Spark does not have the dependency of scopt. cc: @wangyum
https://github.com/databricks/spark-sql-perf/blob/ca4ccea3dd824597601e001d3fefe205e2bf50b0/src/main/scala/com/databricks/spark/sql/perf/tpcds/GenTPCDSData.scala#L45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the empty result set look right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I need to check the answers query-by-query later...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the NULL look right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
Great work, @maropu and thanks for making this happen |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #137658 has finished for PR 32243 at commit
|
.github/workflows/build_and_test.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think it is for now since this PR targets at schema definition changes.
|
Glad to see this update happening. Ideally, if we can store the [compressed] sf1 scale factor in a public server, there shouldn't be much of a need for compiling |
|
I think it's okay. As you pointed out, GitHub Actions already caches the generated SF1 data one, and it doesn't seem adding much codes for that.
@maropu would you mind porting the TPC-DS spec and query update to |
|
My bad. I misunderstood it and I just used different seeds
Of course not. Probably, I have time to work on it tomorrow. |
|
Test build #138095 has finished for PR 32243 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
| val commands = Seq( | ||
| "bash", "-c", | ||
| s"cd $localToolsDir && ./dsdgen -table $tableName -filter Y -scale $scaleFactor " + | ||
| s"-RNGSEED 19620718 $parallel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: In a follow-up PR, I'll revert back this value to 100 to use https://github.com/databricks/tpcds-kit instead. See: #32243 (comment)
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
| * build/sbt "sql/test:runMain <this class> --dsdgenDir <path> --location <path> --scaleFactor 1" | ||
| * }}} | ||
| */ | ||
| object GenTPCDSData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we rename the file tpcdsDatagen.scala -> GenTPCDSData.scala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
| |build/sbt "test:runMain <this class> [Options]" | ||
| |Options: | ||
| | --master the Spark master to use, default to local[*] | ||
| | --dsdgenDir location of dsdgen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only difference from your repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, yes.
| clusterByPartitionColumns = config.clusterByPartitionColumns, | ||
| filterOutNullPartitionValues = config.filterOutNullPartitionValues, | ||
| tableFilter = config.tableFilter, | ||
| numPartitions = config.numPartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need spark.stop() after this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot it. Thank you.
|
Thank you for update, @maropu . The updated version looks neat! |
|
Test build #138110 has finished for PR 32243 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #138117 has finished for PR 32243 at commit
|
This reverts commit a88fe32.
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #138119 has finished for PR 32243 at commit
|
|
okay, ready to review. cc: @juliuszsompolski @npoggi @wangyum |
|
Thank you, @HyukjinKwon @dongjoon-hyun ~ Merged to master. |
|
Arriving a big late. Looks good. We should move the DDL for the tables as resource files at some point. Thanks for the update. |
… Adds a new job in GitHub Actions to check the output of TPC-DS queries ### What changes were proposed in this pull request? This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries. NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in #32243 and #32460. This is the backport PR of #31886. ### Why are the changes needed? There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new test added. Closes #32462 from maropu/TPCDSQUeryTestSuite-Branch3.1. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… Adds a new job in GitHub Actions to check the output of TPC-DS queries ### What changes were proposed in this pull request? This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries. NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in #32243 and #32460. This is the backport PR of #31886. ### Why are the changes needed? There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new test added. Closes #32479 from maropu/TPCDSQueryTestSuite-Branch3.0. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
… Adds a new job in GitHub Actions to check the output of TPC-DS queries ### What changes were proposed in this pull request? This PR proposes to add a new job in GitHub Actions to check the output of TPC-DS queries. NOTE: To generate TPC-DS table data in GA jobs, this PR includes generator code implemented in apache#32243 and apache#32460. This is the backport PR of apache#31886. ### Why are the changes needed? There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The new test added. Closes apache#32462 from maropu/TPCDSQUeryTestSuite-Branch3.1. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR proposes to port minimal code to generate TPC-DS data from databricks/spark-sql-perf. The classes in a new class file
tpcdsDatagen.scalaare basically copied from thedatabricks/spark-sql-perfcodebase.Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them.
The code authors of these classes are:
@juliuszsompolski
@npoggi
@wangyum
Why are the changes needed?
We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,
I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition.
Does this PR introduce any user-facing change?
dev only.
How was this patch tested?
Manually checked and GA passed.