[SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query files. #18592

sarutak · 2017-07-10T18:44:02Z

What changes were proposed in this pull request?

TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
It's because of the failure of reference query files in the jar file.

How was this patch tested?

Ran the benchmark.

SparkQA · 2017-07-10T20:56:13Z

Test build #79473 has finished for PR 18592 at commit 6da7419.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-10T21:01:24Z

Test build #79474 has finished for PR 18592 at commit e5669e4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-10T23:54:57Z

Retest this please.

SparkQA · 2017-07-11T02:11:04Z

Test build #79483 has finished for PR 18592 at commit e5669e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-07-11T07:20:06Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

-      val queryString = fileToString(new File(Thread.currentThread().getContextClassLoader
-        .getResource(s"tpcds/$name.sql").getFile))
+      val queryString = resourceToString(s"tpcds/$name.sql", "UTF-8",
+        Thread.currentThread().getContextClassLoader)


We need encoding explicitly? I feel it'd be better like;

val queryString = resourceToString(s"tpcds/$name.sql", classLoader = Thread.currentThread().getContextClassLoader)

maropu · 2017-07-11T07:20:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

      "please modify the value of dataLocation to point to your local TPCDS data")
    val tableSizes = setupTables(dataLocation)
    queries.foreach { name =>
-      val queryString = fileToString(new File(Thread.currentThread().getContextClassLoader


Plz drop import java.io.File.

maropu · 2017-07-11T07:22:08Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala


  def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
    require(dataLocation.nonEmpty,
      "please modify the value of dataLocation to point to your local TPCDS data")


It seems we don't need this check.

Yes, this is no longer needed.

maropu · 2017-07-11T07:29:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

+    if (args.length < 1) {
+      // scalastyle:off println
+      println(
+        "Usage: spark-submit --class <this class> --jars <spark sql test jar> <data location>")


How about also printing the description below like;

if (args.length < 1) { // scalastyle:off println println( s""" |Usage: spark-submit --class <this class> <spark sql test jar> <TPCDS data location> | |In order to run this benchmark, please follow the instructions at |https://github.com/databricks/spark-sql-perf/blob/master/README.md to generate the TPCDS data |locally (preferably with a scale factor of 5 for benchmarking). Thereafter, the value of |dataLocation below needs to be set to the location where the generated data is stored. """.stripMargin) // scalastyle:on println System.exit(1) }

Printing like as you mentioned.

maropu · 2017-07-11T07:32:24Z

I manually checked this and I think a correct usage is;

spark-submit --class <this class> <test jar> <data>

It seems we don't need --jars?

SparkQA · 2017-07-11T08:49:16Z

Test build #79514 has finished for PR 18592 at commit 14b188a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2017-07-11T08:49:22Z

Yeah, spark-submit --class <this class> <test jar> <data> is collect.
The instruction TPCDSQueryBenchmark.scala said at the head of it was also wrong.
I've fixed it.

maropu · 2017-07-11T08:50:57Z

Thanks!

SparkQA · 2017-07-11T11:46:07Z

Test build #79515 has finished for PR 18592 at commit 2022c45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-07-11T12:02:34Z

cc: @gatorsmile

gatorsmile · 2017-07-11T17:25:40Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

+           |Usage: spark-submit --class <this class> <spark sql test jar> <TPCDS data location>
+           |
+           |In order to run this benchmark, please follow the instructions at
+           |https://github.com/databricks/spark-sql-perf/blob/master/README.md


To be honest, I took a look at this page. It is not easy to understand the instructions. Maybe we also need to resolve that part too.

I a bit played around the generator part in spark-sql-perf and I think the part is small among the package. So, any plan to make a new repository to generate the data only in the databricks repository or something?

Thanks for working on it! Will try it this weekend.

yea, thanks!

maropu · 2017-07-24T02:32:32Z

@gatorsmile Could we merge this first? I feel we could discuss more on jira?

kiszk · 2017-08-02T17:28:12Z

ping @gatorsmile

kiszk · 2017-08-15T18:22:54Z

gentle ping @gatorsmile

kiszk · 2017-09-08T18:09:34Z

ping @gatorsmile

gatorsmile · 2017-09-08T21:06:13Z

Will review it today.

gatorsmile · 2017-09-09T00:18:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

  }

  def main(args: Array[String]): Unit = {
+    if (args.length < 1) {


Could we also allow another way to run this benchmark?

We can hardcode the value of dataLocation and run it in IntelliJ directly.

@sarutak kindly ping

We can pass the argument through the run-configuration even though we use IDE like IntelliJ right?
Or, how about give dataLocation through a new property?

@sarutak @maropu Could we do something like https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala?

Later, we also can add another argument for outputing the plans of TPC-DS queries, instead of running the actual queries.

Good idea. I'll add TPCDSQueryBenchmarkArguments.

maropu · 2017-09-10T15:38:57Z

BTW, I've often used this benchmark class and I think it's some useful to filter parts of TPC-DS queries to run by configurations like spark.sql.tpcds.queryFilter="q3,q18,q23" because it takes much time to run all the quries. WDYT? @gatorsmile

gatorsmile · 2017-09-11T03:39:53Z

Also sounds good to me.

maropu · 2017-09-11T03:56:52Z

thanks, I'll make a pr later.

maropu · 2017-09-11T07:46:50Z

@gatorsmile I opened a new pr, so if you get time, could you check #19188? Thanks!

…nchmark

SparkQA · 2017-09-12T13:44:32Z

Test build #81679 has finished for PR 18592 at commit 06e306f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class TPCDSQueryBenchmarkArguments(val args: Array[String])

SparkQA · 2017-09-12T17:47:02Z

Test build #81682 has finished for PR 18592 at commit d2d22d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-12T17:48:53Z

LGTM

gatorsmile · 2017-09-12T17:50:21Z

Thanks! Merged to master.

jiangxb1987 · 2017-09-12T17:50:56Z

LGTM

sarutak added 2 commits July 11, 2017 03:29

Fixed the way TPCDSQuerySuite refers query files

6da7419

Deleted an extra blank-line

e5669e4

maropu reviewed Jul 11, 2017

View reviewed changes

Reflected review comments

14b188a

Improved usage message

2022c45

gatorsmile reviewed Jul 11, 2017

View reviewed changes

gatorsmile reviewed Sep 9, 2017

View reviewed changes

gatorsmile mentioned this pull request Sep 11, 2017

[SPARK-21973][SQL] Add an new option to filter queries in TPC-DS #19188

Closed

sarutak added 2 commits September 12, 2017 08:39

Merge branch 'master' of git://git.apache.org/spark into fix-tpcds-be…

7521b98

…nchmark

Added TPCDSQueryBenchmarkArguments.scala

06e306f

Fixed style

d2d22d4

asfgit closed this in b9b54b1 Sep 12, 2017

[SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query files. #18592

[SPARK-21368][SQL] TPCDSQueryBenchmark can't refer query files. #18592

Uh oh!

Conversation

sarutak commented Jul 10, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 10, 2017

Uh oh!

SparkQA commented Jul 10, 2017

Uh oh!

dongjoon-hyun commented Jul 10, 2017

Uh oh!

SparkQA commented Jul 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 11, 2017

Uh oh!

SparkQA commented Jul 11, 2017

Uh oh!

sarutak commented Jul 11, 2017

Uh oh!

maropu commented Jul 11, 2017

Uh oh!

SparkQA commented Jul 11, 2017

Uh oh!

maropu commented Jul 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 24, 2017

Uh oh!

kiszk commented Aug 2, 2017

Uh oh!

kiszk commented Aug 15, 2017

Uh oh!

kiszk commented Sep 8, 2017

Uh oh!

gatorsmile commented Sep 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 10, 2017

Uh oh!

gatorsmile commented Sep 11, 2017

Uh oh!

maropu commented Sep 11, 2017

Uh oh!

gatorsmile Sep 11, 2017 •

edited

Loading