Skip to content

Conversation

@sdd
Copy link
Contributor

@sdd sdd commented Jul 28, 2024

This PR adds some performance testing capabilities. It includes the following features:

  • docker-compose environment that includes containers for Minio, Spark, HAProxy and the Iceberg REST Catalog
  • Uses HAProxy to simulate real-world latency and bandwidth constraints of connections to services like S3
  • Includes scripting to create an Iceberg table in the performance testing environment and populate it with data from the widely-used NYC Taxi dataset
  • Adds a justfile for ease of creating, initialising, starting, stopping and tearing down the performance testing environment
  • Adds some Criterion benchmarks that use the performance testing environment to test the performance of TableScan.plan_files in four different representative scenarios
  • Adds some Criterion benchmarks that use the performance testing environment to test the performance of TableScan.to_arrow in four different representative scenarios

The performance tests can be set up and ran by running just perf-run. This will trigger the following actions before actually running the tests. It checks each item to see if it actually needs to be run, skipping if already done on a previous run:

  • Download NYC taxi data parquets
  • Spin up docker containers
  • Create a table
  • Insert test data from the parquets

@sdd sdd mentioned this pull request Aug 2, 2024
@sdd sdd force-pushed the perf-suite branch 5 times, most recently from 6d0a7ee to 56f068e Compare August 9, 2024 23:25
@sdd sdd changed the title feat: performance testing harness and perf tests for scan file plan feat: performance testing harness and perf tests for scan file plan and execution Aug 9, 2024
@sdd sdd changed the title feat: performance testing harness and perf tests for scan file plan and execution Table Scan Performance tests Aug 10, 2024
@sdd sdd changed the title Table Scan Performance tests Table Scan Performance Tests Aug 10, 2024
@sdd sdd marked this pull request as ready for review August 13, 2024 19:18
@sdd
Copy link
Contributor Author

sdd commented Aug 13, 2024

@Xuanwo and @liurenjie1024: This is now passing and ready for review.

@sdd sdd force-pushed the perf-suite branch 2 times, most recently from f90d2d4 to a00b32a Compare August 15, 2024 20:35
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for driving this work!

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this pr. I just skimmed through it and got your points here. I have some concerns with this approach, for example, I feel this approach is difficult to maintain and extend to other cases. I'm more interested in integrated with datafusion to do such thing, like integration tests and benchmark. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants