Skip to content

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented Jul 10, 2017

What changes were proposed in this pull request?

TPCDSQueryBenchmark packaged into a jar doesn't work with spark-submit.
It's because of the failure of reference query files in the jar file.

How was this patch tested?

Ran the benchmark.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79473 has finished for PR 18592 at commit 6da7419.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79474 has finished for PR 18592 at commit e5669e4.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79483 has finished for PR 18592 at commit e5669e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val queryString = fileToString(new File(Thread.currentThread().getContextClassLoader
.getResource(s"tpcds/$name.sql").getFile))
val queryString = resourceToString(s"tpcds/$name.sql", "UTF-8",
Thread.currentThread().getContextClassLoader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need encoding explicitly? I feel it'd be better like;

val queryString = resourceToString(s"tpcds/$name.sql",
  classLoader = Thread.currentThread().getContextClassLoader)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"please modify the value of dataLocation to point to your local TPCDS data")
val tableSizes = setupTables(dataLocation)
queries.foreach { name =>
val queryString = fileToString(new File(Thread.currentThread().getContextClassLoader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz drop import java.io.File.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped.


def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
require(dataLocation.nonEmpty,
"please modify the value of dataLocation to point to your local TPCDS data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we don't need this check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is no longer needed.

if (args.length < 1) {
// scalastyle:off println
println(
"Usage: spark-submit --class <this class> --jars <spark sql test jar> <data location>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about also printing the description below like;

    if (args.length < 1) {
      // scalastyle:off println
      println(
        s"""
           |Usage: spark-submit --class <this class> <spark sql test jar> <TPCDS data location>
           |
           |In order to run this benchmark, please follow the instructions at
           |https://github.com/databricks/spark-sql-perf/blob/master/README.md to generate the TPCDS data
           |locally (preferably with a scale factor of 5 for benchmarking). Thereafter, the value of
           |dataLocation below needs to be set to the location where the generated data is stored.
         """.stripMargin)
      // scalastyle:on println
      System.exit(1)
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing like as you mentioned.

@maropu
Copy link
Member

maropu commented Jul 11, 2017

I manually checked this and I think a correct usage is;

spark-submit --class <this class> <test jar> <data>

It seems we don't need --jars?

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79514 has finished for PR 18592 at commit 14b188a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member Author

sarutak commented Jul 11, 2017

Yeah, spark-submit --class <this class> <test jar> <data> is collect.
The instruction TPCDSQueryBenchmark.scala said at the head of it was also wrong.
I've fixed it.

@maropu
Copy link
Member

maropu commented Jul 11, 2017

Thanks!

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79515 has finished for PR 18592 at commit 2022c45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 11, 2017

cc: @gatorsmile

|Usage: spark-submit --class <this class> <spark sql test jar> <TPCDS data location>
|
|In order to run this benchmark, please follow the instructions at
|https://github.com/databricks/spark-sql-perf/blob/master/README.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I took a look at this page. It is not easy to understand the instructions. Maybe we also need to resolve that part too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I a bit played around the generator part in spark-sql-perf and I think the part is small among the package. So, any plan to make a new repository to generate the data only in the databricks repository or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on it! Will try it this weekend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, thanks!

@maropu
Copy link
Member

maropu commented Jul 24, 2017

@gatorsmile Could we merge this first? I feel we could discuss more on jira?

@kiszk
Copy link
Member

kiszk commented Aug 2, 2017

ping @gatorsmile

@kiszk
Copy link
Member

kiszk commented Aug 15, 2017

gentle ping @gatorsmile

@kiszk
Copy link
Member

kiszk commented Sep 8, 2017

ping @gatorsmile

@gatorsmile
Copy link
Member

Will review it today.

}

def main(args: Array[String]): Unit = {
if (args.length < 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also allow another way to run this benchmark?

We can hardcode the value of dataLocation and run it in IntelliJ directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarutak kindly ping

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can pass the argument through the run-configuration even though we use IDE like IntelliJ right?
Or, how about give dataLocation through a new property?

Copy link
Member

@gatorsmile gatorsmile Sep 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarutak @maropu Could we do something like https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterArguments.scala?

Later, we also can add another argument for outputing the plans of TPC-DS queries, instead of running the actual queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll add TPCDSQueryBenchmarkArguments.

@maropu
Copy link
Member

maropu commented Sep 10, 2017

BTW, I've often used this benchmark class and I think it's some useful to filter parts of TPC-DS queries to run by configurations like spark.sql.tpcds.queryFilter="q3,q18,q23" because it takes much time to run all the quries. WDYT? @gatorsmile

@gatorsmile
Copy link
Member

Also sounds good to me.

@maropu
Copy link
Member

maropu commented Sep 11, 2017

thanks, I'll make a pr later.

@maropu
Copy link
Member

maropu commented Sep 11, 2017

@gatorsmile I opened a new pr, so if you get time, could you check #19188? Thanks!

@SparkQA
Copy link

SparkQA commented Sep 12, 2017

Test build #81679 has finished for PR 18592 at commit 06e306f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class TPCDSQueryBenchmarkArguments(val args: Array[String])

@SparkQA
Copy link

SparkQA commented Sep 12, 2017

Test build #81682 has finished for PR 18592 at commit d2d22d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@jiangxb1987
Copy link
Contributor

LGTM

@asfgit asfgit closed this in b9b54b1 Sep 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants