Add a convenient class to generate TPC-DS data #196

wangyum · 2021-01-24T04:34:23Z

How to use it:

build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d /root/tmp/tpcds-kit/tools -s 5 -l /root/tmp/tpcds5g -f parquet"

[root@spark-3267648 spark-sql-perf]# build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help"
[info] Running com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help
[info] Usage: Gen-TPC-DS-data [options]
[info]
[info]   -m, --master <value>     the Spark master to use, default to local[*]
[info]   -d, --dsdgenDir <value>  location of dsdgen
[info]   -s, --scaleFactor <value>
[info]                            scaleFactor defines the size of the dataset to generate (in GB)
[info]   -l, --location <value>   root directory of location to create data in
[info]   -f, --format <value>     valid spark format, Parquet, ORC ...
[info]   -i, --useDoubleForDecimal <value>
[info]                            true to replace DecimalType with DoubleType
[info]   -e, --useStringForDate <value>
[info]                            true to replace DateType with StringType
[info]   -o, --overwrite <value>  overwrite the data that is already there
[info]   -p, --partitionTables <value>
[info]                            create the partitioned fact tables
[info]   -c, --clusterByPartitionColumns <value>
[info]                            shuffle to get partitions coalesced into single files
[info]   -v, --filterOutNullPartitionValues <value>
[info]                            true to filter out the partition with NULL key value
[info]   -t, --tableFilter <value>
[info]                            "" means generate all tables
[info]   -n, --numPartitions <value>
[info]                            how many dsdgen partitions to run - number of input tasks.
[info]   --help                   prints this usage text

src/main/scala/com/databricks/spark/sql/perf/tpcds/GenTPCDSData.scala

…a.scala Add license. Co-authored-by: Hyukjin Kwon <[email protected]>

npoggi

@wangyum could you update the README.md with the instructions and example?
Also, please edit the example from sf 5 (5GB) to either 1, 3, or 10. As 5 is not covered by the TPCDS spec. Thanks for the contribution.

into GenTPCDSData

HyukjinKwon · 2021-03-29T13:45:03Z

@wangyum, is it ready for a review?

wangyum · 2021-03-29T13:45:38Z

Yes.

HyukjinKwon

This looks good especially given that it will likely be used regularly in Apache Spark's CI (apache/spark#31886).

HyukjinKwon · 2021-03-29T13:49:24Z

I will leave it @npoggi for the final look.

npoggi

LGTM. Go ahead with the merge, thanks!

wangyum added 2 commits January 23, 2021 14:39

Gen TPCDS data.

abf08eb

fix

d533a49

wangyum mentioned this pull request Jan 24, 2021

[SPARK-34211][SQL][TESTS] Benchmark TPC-DS with 1GB scale factor apache/spark#31303

Closed

HyukjinKwon requested a review from npoggi January 24, 2021 05:03

HyukjinKwon reviewed Jan 24, 2021

View reviewed changes

src/main/scala/com/databricks/spark/sql/perf/tpcds/GenTPCDSData.scala Show resolved Hide resolved

HyukjinKwon reviewed Jan 24, 2021

View reviewed changes

src/main/scala/com/databricks/spark/sql/perf/tpcds/GenTPCDSData.scala Show resolved Hide resolved

Update src/main/scala/com/databricks/spark/sql/perf/tpcds/GenTPCDSDat…

720a1b8

…a.scala Add license. Co-authored-by: Hyukjin Kwon <[email protected]>

npoggi reviewed Jan 25, 2021

View reviewed changes

wangyum added 5 commits January 29, 2021 16:51

Merge remote-tracking branch 'upstream/master' into GenTPCDSData

66e2b12

Merge branch 'GenTPCDSData' of https://github.com/wangyum/spark-sql-perf

0914c5e

into GenTPCDSData

fix

947cc72

fix

b09d1ce

fix

6518df4

wangyum closed this Mar 13, 2021

HyukjinKwon reopened this Mar 29, 2021

wangyum closed this Mar 29, 2021

wangyum reopened this Mar 29, 2021

HyukjinKwon approved these changes Mar 29, 2021

View reviewed changes

HyukjinKwon mentioned this pull request Mar 30, 2021

[SPARK-34795][SQL][TESTS] Adds a new job in GitHub Actions to check the output of TPC-DS queries apache/spark#31886

Closed

npoggi approved these changes Mar 30, 2021

View reviewed changes

HyukjinKwon merged commit ca4ccea into databricks:master Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a convenient class to generate TPC-DS data #196

Add a convenient class to generate TPC-DS data #196

Uh oh!

wangyum commented Jan 24, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

npoggi left a comment

Uh oh!

HyukjinKwon commented Mar 29, 2021

Uh oh!

wangyum commented Mar 29, 2021

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Mar 29, 2021

Uh oh!

npoggi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add a convenient class to generate TPC-DS data #196

Add a convenient class to generate TPC-DS data #196

Uh oh!

Conversation

wangyum commented Jan 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

npoggi left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 29, 2021

Uh oh!

wangyum commented Mar 29, 2021

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 29, 2021

Uh oh!

npoggi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangyum commented Jan 24, 2021 •

edited

Loading