-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16196][SQL] Codegen in-memory scan with ColumnarBatches #13899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note, this doesn't work: spark.table("tab1").collect(), because
we're trying to cast ColumnarBatch.Row into UnsafeRow. This works,
however: spark.table("tab1").groupBy("i").sum("j").collect().
Previously we could only support schemas where all columns are Longs because we hardcode putLong and getLong calls in the write path. This led to unfathomable NPEs if we try to cache something with other types. This commit fixes this by generalizing the code to build column batches.
|
Test build #61203 has finished for PR 13899 at commit
|
|
Test build #61206 has finished for PR 13899 at commit
|
|
@andrewor14 Looks interesting. I created two PRs (#11956, #12894) that generate similar code like your code. My PRs use current I am waiting for comitter's review of my two PRs. |
|
jenkins retest this please |
|
Test build #64644 has finished for PR 13899 at commit
|
|
Test build #3237 has finished for PR 13899 at commit
|
|
Closing for now; too many conflicts. |
What changes were proposed in this pull request?
This patch makes
InMemoryRelationfaster by generating code to store the input rows asColumnarBatches. This code path is enabled by default but only supports primitive types, falling back to the old, slower code path if there are unsupported types (e.g. strings, arrays, UDTs) in the schema.The old code path reads the input rows into
ColumnBuilders, which is slow because these builders are backed byByteBuffers and there are a lot of virtual function calls involved, especially when compression is enabled.The following numbers are derived from the read path (i.e. returning rows from cached batches in memory). The baseline is the first row. The second and third rows describe caching performance before this patch. The last row describes caching performance after this patch.
Future TODOs (outside the scope of this issue):
ColumnarBatch.Rows instead ofUnsafeRows so operators downstream can further benefit from the columnar representationHow was this patch tested?
CacheBenchmark,InMemoryColumnarQuerySuite, existing testsGenerated code
Write path: https://gist.github.com/andrewor14/a9ed9d942029457a0f953e809ac26ee9
Read path: https://gist.github.com/andrewor14/7ce4c37a3c6bcd5cc2b6b16c861859e9