-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23819][SQL] Fix InMemoryTableScanExec complex type pruning #20935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… to out of date ColumnStats
|
ok to test |
|
cc @kiszk |
|
Test build #88757 has finished for PR 20935 at commit
|
| ordering.foreach { order => | ||
| val value = row.get(ordinal, dataType) | ||
| if (upper == null || order.gt(value, upper)) upper = value | ||
| if (lower == null || order.lt(value, lower)) lower = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unsafe row and array, doesn't we need to copy the value? In the added test this can't be tested because the random rows are all individual instances, however, it can be the same instance of unsafe row or array during query evaluation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for catching this.
| testColumnStats(classOf[DoubleColumnStats], DOUBLE, Array(Double.MaxValue, Double.MinValue, 0)) | ||
| testColumnStats(classOf[StringColumnStats], STRING, Array(null, null, 0)) | ||
| testDecimalColumnStats(Array(null, null, 0)) | ||
| testColumnStats(classOf[BooleanColumnStats], BOOLEAN, Array(true, false, 0, 0, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those changes to testColumnStats seems unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column statistics have 5 fields in their array, so the zip comparison on the initial stats will drop the final two.
| unsafeRow.getMap(0).copy | ||
| } | ||
| toUnsafeMap(ArrayBasedMapData( | ||
| Map(Random.nextInt() -> UTF8String.fromString(Random.nextString(Random.nextInt(32)))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems above changes to data generation are unnecessary too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ColumnType for Maps/Struct/Array all end up casting to their Unsafe structures to get the size for the statistics, so the test data will need to reflect that as well.
|
|
||
| private[columnar] final class ObjectColumnStats(dataType: DataType) extends ColumnStats { | ||
| val columnType = ColumnType(dataType) | ||
| private abstract class OrderableSafeColumnStats[T](dataType: DataType) extends ColumnStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrderableObjectColumnStats?
| private val ordering = dataType match { | ||
| case x if RowOrdering.isOrderable(dataType) => | ||
| Option(TypeUtils.getInterpretedOrdering(x)) | ||
| case _ => None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this class is only for "orderable", maybe we don't need optional here and ordering can just be Ordering[T].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for DataTypes that could be orderable since Arrays and Structs may have children data types that aren't.
| gatherValueStats(value) | ||
| } else { | ||
| gatherNullStats | ||
| gatherNullStats() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the change to gatherNullStats is necessary...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was mostly from the scala style guide since this mutates the backing stats. http://docs.scala-lang.org/style/method-invocation.html#arity-0
I don't have a strong opinion though, so happy to swap it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just swap it back to make the diff small.
| } | ||
| } | ||
|
|
||
| def testStructColumnStats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we merge testArrayColumnStats, testMapColumnStats and testStructColumnStats?
| } | ||
|
|
||
| private[columnar] final class StructColumnStats(dataType: DataType) | ||
| extends OrderableSafeColumnStats[InternalRow](dataType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InternalRow -> UnsafeRow? Looks like for struct, the column type is specified for UnsafeRow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as for the ArrayData above.
| } | ||
|
|
||
| private[columnar] final class ArrayColumnStats(dataType: DataType) | ||
| extends OrderableSafeColumnStats[ArrayData](dataType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayData -> UnsafeArrayData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be scoping it down? The API for InternalRow gives back ArrayData, so we'd need a cast to do so.
| test(s"${dataType.typeName}: non-empty") { | ||
| import org.apache.spark.sql.execution.columnar.ColumnarTestUtils._ | ||
| val objectStats = new ArrayColumnStats(dataType) | ||
| val rows = Seq.fill(10)(makeRandomRow(columnType)) ++ Seq.fill(10)(makeNullRow(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't reuse the unsafe array/row here, we don't actually test on the copying in corresponding column statistics, can we have the test data reusing the unsafe structures to test array and struct column statistics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do.
|
Test build #88785 has finished for PR 20935 at commit
|
| override def copy(value: UnsafeArrayData): UnsafeArrayData = value.copy() | ||
| } | ||
|
|
||
| private[columnar] final class StructColumnStats(dataType: DataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataType: DataType -> dataType: StructType?
| Array[Any](lower, upper, nullCount, count, sizeInBytes) | ||
| } | ||
|
|
||
| private[columnar] final class ArrayColumnStats(dataType: DataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataType: DataType -> dataType: ArrayType?
| override def copy(value: UnsafeRow): UnsafeRow = value.copy() | ||
| } | ||
|
|
||
| private[columnar] final class MapColumnStats(dataType: DataType) extends ColumnStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataType: DataType -> dataType: MapType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a TODO that we need to make this use OrderableSafeColumnStats when MapType is orderable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention it - we can just have it use it now since it will always fall through to the unorderable case. Everything will just work when we make it orderable w/o code change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sounds good to me.
| } | ||
| } | ||
|
|
||
| test("Reuse UnsafeArrayData for stats") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also test against UnsafeRow too.
|
cc @cloud-fan |
|
Test build #88786 has finished for PR 20935 at commit
|
|
Test build #88797 has finished for PR 20935 at commit
|
|
@cloud-fan @gatorsmile @kiszk - any thoughts on this PR? |
|
Ping @cloud-fan @gatorsmile @kiszk |
|
Anything else to be done here? |
|
ok to test |
|
I'd like to review this PR after the parquet nested column pruning is merged. |
|
Test build #93060 has finished for PR 20935 at commit
|
| import org.apache.spark.sql.types.{AtomicType, Decimal} | ||
| import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, UnsafeArrayData, UnsafeMapData, UnsafeProjection} | ||
| import org.apache.spark.sql.catalyst.util.ArrayBasedMapData | ||
| import org.apache.spark.sql.types.{AtomicType, DataType, Decimal, IntegerType, MapType, StringType, StructField, StructType} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it imports more then 5, wlidcard can be used as well per style guide.
|
ok to test |
|
can you fix #21882 back since that PR whitelisted the types |
|
will this share the same infra with the parquet nested column pruning? |
|
Test build #100819 has finished for PR 20935 at commit
|
| sizeInBytes += columnType.actualSize(row, ordinal) | ||
| count += 1 | ||
| ordering.foreach { order => | ||
| val value = getValue(row, ordinal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we move this statement out of foreach since this is loop-invariant?
| sizeInBytes += size | ||
| sizeInBytes += columnType.actualSize(row, ordinal) | ||
| count += 1 | ||
| ordering.foreach { order => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have more than one elements in ordering? If not, can we write this without foreach? It could achieve better performance.
|
kindly ping @pwoody |
|
ping @pwoody |
|
Can one of the admins verify this patch? |
|
gentle ping @pwoody |
|
@pwoody @HyukjinKwon @viirya May I take over this since he did not respond for a long time? |
|
I think it is fine as his last response is more than 1 yr ago. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR allows recording of upper/lower bound values in ColumnStats if the data type is orderable.
How was this patch tested?
Added tests to ColumnStatsSuite and InMemoryColumnarQuerySuite.
Please review http://spark.apache.org/contributing.html before opening a pull request.