-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query files. #18592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #79473 has finished for PR 18592 at commit
|
|
Test build #79474 has finished for PR 18592 at commit
|
|
Retest this please. |
|
Test build #79483 has finished for PR 18592 at commit
|
| val queryString = fileToString(new File(Thread.currentThread().getContextClassLoader | ||
| .getResource(s"tpcds/$name.sql").getFile)) | ||
| val queryString = resourceToString(s"tpcds/$name.sql", "UTF-8", | ||
| Thread.currentThread().getContextClassLoader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need encoding explicitly? I feel it'd be better like;
val queryString = resourceToString(s"tpcds/$name.sql",
classLoader = Thread.currentThread().getContextClassLoader)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| "please modify the value of dataLocation to point to your local TPCDS data") | ||
| val tableSizes = setupTables(dataLocation) | ||
| queries.foreach { name => | ||
| val queryString = fileToString(new File(Thread.currentThread().getContextClassLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz drop import java.io.File.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped.
|
|
||
| def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = { | ||
| require(dataLocation.nonEmpty, | ||
| "please modify the value of dataLocation to point to your local TPCDS data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we don't need this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is no longer needed.
| if (args.length < 1) { | ||
| // scalastyle:off println | ||
| println( | ||
| "Usage: spark-submit --class <this class> --jars <spark sql test jar> <data location>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about also printing the description below like;
if (args.length < 1) {
// scalastyle:off println
println(
s"""
|Usage: spark-submit --class <this class> <spark sql test jar> <TPCDS data location>
|
|In order to run this benchmark, please follow the instructions at
|https://github.com/databricks/spark-sql-perf/blob/master/README.md to generate the TPCDS data
|locally (preferably with a scale factor of 5 for benchmarking). Thereafter, the value of
|dataLocation below needs to be set to the location where the generated data is stored.
""".stripMargin)
// scalastyle:on println
System.exit(1)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Printing like as you mentioned.
|
I manually checked this and I think a correct usage is; It seems we don't need |
|
Test build #79514 has finished for PR 18592 at commit
|
|
Yeah, |
|
Thanks! |
|
Test build #79515 has finished for PR 18592 at commit
|
|
cc: @gatorsmile |
| |Usage: spark-submit --class <this class> <spark sql test jar> <TPCDS data location> | ||
| | | ||
| |In order to run this benchmark, please follow the instructions at | ||
| |https://github.com/databricks/spark-sql-perf/blob/master/README.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I took a look at this page. It is not easy to understand the instructions. Maybe we also need to resolve that part too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I a bit played around the generator part in spark-sql-perf and I think the part is small among the package. So, any plan to make a new repository to generate the data only in the databricks repository or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on it! Will try it this weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, thanks!
|
@gatorsmile Could we merge this first? I feel we could discuss more on jira? |
|
ping @gatorsmile |
|
gentle ping @gatorsmile |
|
ping @gatorsmile |
|
Will review it today. |
| } | ||
|
|
||
| def main(args: Array[String]): Unit = { | ||
| if (args.length < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also allow another way to run this benchmark?
We can hardcode the value of dataLocation and run it in IntelliJ directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarutak kindly ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can pass the argument through the run-configuration even though we use IDE like IntelliJ right?
Or, how about give dataLocation through a new property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarutak @maropu Could we do something like https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala?
Later, we also can add another argument for outputing the plans of TPC-DS queries, instead of running the actual queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll add TPCDSQueryBenchmarkArguments.
|
BTW, I've often used this benchmark class and I think it's some useful to filter parts of TPC-DS queries to run by configurations like |
|
Also sounds good to me. |
|
thanks, I'll make a pr later. |
|
@gatorsmile I opened a new pr, so if you get time, could you check #19188? Thanks! |
|
Test build #81679 has finished for PR 18592 at commit
|
|
Test build #81682 has finished for PR 18592 at commit
|
|
LGTM |
|
Thanks! Merged to master. |
|
LGTM |
What changes were proposed in this pull request?
TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
It's because of the failure of reference query files in the jar file.
How was this patch tested?
Ran the benchmark.