Skip to content

Commit cd689c9

Browse files
committed
[SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf
### What changes were proposed in this pull request? This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase. Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them. The code authors of these classes are: juliuszsompolski npoggi wangyum ### Why are the changes needed? We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g., - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition. ### Does this PR introduce _any_ user-facing change? dev only. ### How was this patch tested? Manually checked and GA passed. Closes #32243 from maropu/tpcdsDatagen. Authored-by: Takeshi Yamamuro <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
1 parent caa46ce commit cd689c9

File tree

4 files changed

+1019
-549
lines changed

4 files changed

+1019
-549
lines changed

.github/workflows/build_and_test.yml

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -493,19 +493,6 @@ jobs:
493493
steps:
494494
- name: Checkout Spark repository
495495
uses: actions/checkout@v2
496-
- name: Cache TPC-DS generated data
497-
id: cache-tpcds-sf-1
498-
uses: actions/cache@v2
499-
with:
500-
path: ./tpcds-sf-1
501-
key: tpcds-556111e35d400f56cb0625dc16e9063d54628320
502-
- name: Checkout TPC-DS (SF=1) generated data repository
503-
if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
504-
uses: actions/checkout@v2
505-
with:
506-
repository: maropu/spark-tpcds-sf-1
507-
ref: 556111e35d400f56cb0625dc16e9063d54628320
508-
path: ./tpcds-sf-1
509496
- name: Cache Scala, SBT and Maven
510497
uses: actions/cache@v2
511498
with:
@@ -528,6 +515,24 @@ jobs:
528515
uses: actions/setup-java@v1
529516
with:
530517
java-version: 8
518+
- name: Cache TPC-DS generated data
519+
id: cache-tpcds-sf-1
520+
uses: actions/cache@v2
521+
with:
522+
path: ./tpcds-sf-1
523+
key: tpcds-${{ hashFiles('sql/core/src/test/scala/org/apache/spark/sql/TPCDSSchema.scala') }}
524+
- name: Checkout tpcds-kit repository
525+
if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
526+
uses: actions/checkout@v2
527+
with:
528+
repository: maropu/spark-tpcds-datagen
529+
path: ./tpcds-kit
530+
- name: Build tpcds-kit
531+
if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
532+
run: cd tpcds-kit/thirdparty/tpcds-kit/tools && make OS=LINUX
533+
- name: Generate TPC-DS (SF=1) table data
534+
if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
535+
run: build/sbt "sql/test:runMain org.apache.spark.sql.GenTPCDSData --dsdgenDir `pwd`/tpcds-kit/thirdparty/tpcds-kit/tools --location `pwd`/tpcds-sf-1 --scaleFactor 1 --numPartitions 1 --overwrite"
531536
- name: Run TPC-DS queries
532537
run: |
533538
SPARK_TPCDS_DATA=`pwd`/tpcds-sf-1 build/sbt "sql/testOnly org.apache.spark.sql.TPCDSQueryTestSuite"

0 commit comments

Comments
 (0)