Skip to content

Conversation

@wangyum
Copy link
Contributor

@wangyum wangyum commented Jan 24, 2021

How to use it:

build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d /root/tmp/tpcds-kit/tools -s 5 -l /root/tmp/tpcds5g -f parquet"
[root@spark-3267648 spark-sql-perf]# build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help"
[info] Running com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help
[info] Usage: Gen-TPC-DS-data [options]
[info]
[info]   -m, --master <value>     the Spark master to use, default to local[*]
[info]   -d, --dsdgenDir <value>  location of dsdgen
[info]   -s, --scaleFactor <value>
[info]                            scaleFactor defines the size of the dataset to generate (in GB)
[info]   -l, --location <value>   root directory of location to create data in
[info]   -f, --format <value>     valid spark format, Parquet, ORC ...
[info]   -i, --useDoubleForDecimal <value>
[info]                            true to replace DecimalType with DoubleType
[info]   -e, --useStringForDate <value>
[info]                            true to replace DateType with StringType
[info]   -o, --overwrite <value>  overwrite the data that is already there
[info]   -p, --partitionTables <value>
[info]                            create the partitioned fact tables
[info]   -c, --clusterByPartitionColumns <value>
[info]                            shuffle to get partitions coalesced into single files
[info]   -v, --filterOutNullPartitionValues <value>
[info]                            true to filter out the partition with NULL key value
[info]   -t, --tableFilter <value>
[info]                            "" means generate all tables
[info]   -n, --numPartitions <value>
[info]                            how many dsdgen partitions to run - number of input tasks.
[info]   --help                   prints this usage text

Copy link
Contributor

@npoggi npoggi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum could you update the README.md with the instructions and example?
Also, please edit the example from sf 5 (5GB) to either 1, 3, or 10. As 5 is not covered by the TPCDS spec. Thanks for the contribution.

@wangyum wangyum closed this Mar 13, 2021
@HyukjinKwon HyukjinKwon reopened this Mar 29, 2021
@HyukjinKwon
Copy link
Member

@wangyum, is it ready for a review?

@wangyum wangyum closed this Mar 29, 2021
@wangyum wangyum reopened this Mar 29, 2021
@wangyum
Copy link
Contributor Author

wangyum commented Mar 29, 2021

Yes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good especially given that it will likely be used regularly in Apache Spark's CI (apache/spark#31886).

@HyukjinKwon
Copy link
Member

I will leave it @npoggi for the final look.

Copy link
Contributor

@npoggi npoggi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Go ahead with the merge, thanks!

@HyukjinKwon HyukjinKwon merged commit ca4ccea into databricks:master Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants