[SPARK-23819][SQL] Fix InMemoryTableScanExec complex type pruning #20935

pwoody · 2018-03-29T11:13:21Z

What changes were proposed in this pull request?

This PR allows recording of upper/lower bound values in ColumnStats if the data type is orderable.

How was this patch tested?

Added tests to ColumnStatsSuite and InMemoryColumnarQuerySuite.

Please review http://spark.apache.org/contributing.html before opening a pull request.

… to out of date ColumnStats

gatorsmile · 2018-03-30T16:31:23Z

ok to test

gatorsmile · 2018-03-30T16:31:43Z

cc @kiszk

SparkQA · 2018-03-30T18:47:43Z

Test build #88757 has finished for PR 20935 at commit 5c95cef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-03-31T07:12:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+      ordering.foreach { order =>
+        val value = row.get(ordinal, dataType)
+        if (upper == null || order.gt(value, upper)) upper = value
+        if (lower == null || order.lt(value, lower)) lower = value


For unsafe row and array, doesn't we need to copy the value? In the added test this can't be tested because the random rows are all individual instances, however, it can be the same instance of unsafe row or array during query evaluation.

Yes, thanks for catching this.

viirya · 2018-03-31T07:15:33Z

sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnStatsSuite.scala

-  testColumnStats(classOf[DoubleColumnStats], DOUBLE, Array(Double.MaxValue, Double.MinValue, 0))
-  testColumnStats(classOf[StringColumnStats], STRING, Array(null, null, 0))
-  testDecimalColumnStats(Array(null, null, 0))
+  testColumnStats(classOf[BooleanColumnStats], BOOLEAN, Array(true, false, 0, 0, 0))


Those changes to testColumnStats seems unnecessary?

The column statistics have 5 fields in their array, so the zip comparison on the initial stats will drop the final two.

viirya · 2018-03-31T07:18:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnarTestUtils.scala

+          unsafeRow.getMap(0).copy
+        }
+        toUnsafeMap(ArrayBasedMapData(
+          Map(Random.nextInt() -> UTF8String.fromString(Random.nextString(Random.nextInt(32))))))


Seems above changes to data generation are unnecessary too?

The ColumnType for Maps/Struct/Array all end up casting to their Unsafe structures to get the size for the statistics, so the test data will need to reflect that as well.

viirya · 2018-03-31T22:56:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala


-private[columnar] final class ObjectColumnStats(dataType: DataType) extends ColumnStats {
-  val columnType = ColumnType(dataType)
+private abstract class OrderableSafeColumnStats[T](dataType: DataType) extends ColumnStats {


OrderableObjectColumnStats?

viirya · 2018-03-31T22:57:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+  private val ordering = dataType match {
+    case x if RowOrdering.isOrderable(dataType) =>
+      Option(TypeUtils.getInterpretedOrdering(x))
+    case _ => None


Since this class is only for "orderable", maybe we don't need optional here and ordering can just be Ordering[T].

This is for DataTypes that could be orderable since Arrays and Structs may have children data types that aren't.

viirya · 2018-03-31T22:58:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

      gatherValueStats(value)
    } else {
-      gatherNullStats
+      gatherNullStats()


I don't think the change to gatherNullStats is necessary...

Yeah this was mostly from the scala style guide since this mutates the backing stats. http://docs.scala-lang.org/style/method-invocation.html#arity-0
I don't have a strong opinion though, so happy to swap it back.

Let's just swap it back to make the diff small.

viirya · 2018-03-31T23:03:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnStatsSuite.scala

+    }
+  }
+
+  def testStructColumnStats(


Can't we merge testArrayColumnStats, testMapColumnStats and testStructColumnStats?

viirya · 2018-03-31T23:16:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+}
+
+private[columnar] final class StructColumnStats(dataType: DataType)
+  extends OrderableSafeColumnStats[InternalRow](dataType) {


InternalRow -> UnsafeRow? Looks like for struct, the column type is specified for UnsafeRow.

Same question as for the ArrayData above.

viirya · 2018-03-31T23:17:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+}
+
+private[columnar] final class ArrayColumnStats(dataType: DataType)
+  extends OrderableSafeColumnStats[ArrayData](dataType) {


ArrayData -> UnsafeArrayData?

Should we be scoping it down? The API for InternalRow gives back ArrayData, so we'd need a cast to do so.

viirya · 2018-03-31T23:21:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnStatsSuite.scala

+    test(s"${dataType.typeName}: non-empty") {
+      import org.apache.spark.sql.execution.columnar.ColumnarTestUtils._
+      val objectStats = new ArrayColumnStats(dataType)
+      val rows = Seq.fill(10)(makeRandomRow(columnType)) ++ Seq.fill(10)(makeNullRow(1))


Because we don't reuse the unsafe array/row here, we don't actually test on the copying in corresponding column statistics, can we have the test data reusing the unsafe structures to test array and struct column statistics?

Yep, will do.

SparkQA · 2018-04-01T01:13:01Z

Test build #88785 has finished for PR 20935 at commit 426374b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-04-01T02:24:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+  override def copy(value: UnsafeArrayData): UnsafeArrayData = value.copy()
+}
+
+private[columnar] final class StructColumnStats(dataType: DataType)


dataType: DataType -> dataType: StructType?

viirya · 2018-04-01T02:25:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+    Array[Any](lower, upper, nullCount, count, sizeInBytes)
+}
+
+private[columnar] final class ArrayColumnStats(dataType: DataType)


dataType: DataType -> dataType: ArrayType?

viirya · 2018-04-01T02:26:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+  override def copy(value: UnsafeRow): UnsafeRow = value.copy()
+}
+
+private[columnar] final class MapColumnStats(dataType: DataType) extends ColumnStats {


dataType: DataType -> dataType: MapType?

Please add a TODO that we need to make this use OrderableSafeColumnStats when MapType is orderable.

Now that you mention it - we can just have it use it now since it will always fall through to the unorderable case. Everything will just work when we make it orderable w/o code change here.

Oh, sounds good to me.

viirya · 2018-04-01T02:34:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnStatsSuite.scala

+    }
+  }
+
+  test("Reuse UnsafeArrayData for stats") {


We should also test against UnsafeRow too.

viirya · 2018-04-01T03:00:30Z

cc @cloud-fan

SparkQA · 2018-04-01T03:59:02Z

Test build #88786 has finished for PR 20935 at commit 1479bde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-01T18:36:07Z

Test build #88797 has finished for PR 20935 at commit 6ea0919.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pwoody · 2018-04-06T16:25:00Z

@cloud-fan @gatorsmile @kiszk - any thoughts on this PR?

pwoody · 2018-04-16T12:50:55Z

Ping @cloud-fan @gatorsmile @kiszk

pwoody · 2018-05-03T21:40:11Z

Anything else to be done here?

HyukjinKwon · 2018-07-16T02:29:09Z

ok to test

cloud-fan · 2018-07-16T06:23:25Z

I'd like to review this PR after the parquet nested column pruning is merged.

SparkQA · 2018-07-16T06:53:43Z

Test build #93060 has finished for PR 20935 at commit 6ea0919.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-06T04:13:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnarTestUtils.scala

-import org.apache.spark.sql.types.{AtomicType, Decimal}
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, UnsafeArrayData, UnsafeMapData, UnsafeProjection}
+import org.apache.spark.sql.catalyst.util.ArrayBasedMapData
+import org.apache.spark.sql.types.{AtomicType, DataType, Decimal, IntegerType, MapType, StringType, StructField, StructType}


If it imports more then 5, wlidcard can be used as well per style guide.

HyukjinKwon · 2019-01-06T04:14:37Z

ok to test

HyukjinKwon · 2019-01-06T04:14:56Z

can you fix #21882 back since that PR whitelisted the types

cloud-fan · 2019-01-06T04:24:34Z

will this share the same infra with the parquet nested column pruning?

SparkQA · 2019-01-06T08:05:02Z

Test build #100819 has finished for PR 20935 at commit 6ea0919.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2019-01-11T09:14:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

+      sizeInBytes += columnType.actualSize(row, ordinal)
      count += 1
+      ordering.foreach { order =>
+        val value = getValue(row, ordinal)


nit: Can we move this statement out of foreach since this is loop-invariant?

kiszk · 2019-01-11T09:15:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala

-      sizeInBytes += size
+      sizeInBytes += columnType.actualSize(row, ordinal)
      count += 1
+      ordering.foreach { order =>


Do we have more than one elements in ordering? If not, can we write this without foreach? It could achieve better performance.

kiszk · 2019-02-24T16:19:14Z

kindly ping @pwoody

HyukjinKwon · 2019-09-17T00:19:44Z

ping @pwoody

AmplabJenkins · 2019-09-30T13:33:32Z

Can one of the admins verify this patch?

kiszk · 2019-10-02T10:16:47Z

gentle ping @pwoody

kiszk · 2019-10-09T02:38:21Z

@pwoody @HyukjinKwon @viirya May I take over this since he did not respond for a long time?

viirya · 2019-10-09T03:41:23Z

I think it is fine as his last response is more than 1 yr ago.

github-actions · 2020-01-18T00:08:06Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Patrick Woody added 2 commits March 29, 2018 07:06

SPARK-23819: InMemoryTableScanExec prunes orderable complex types due…

83e1e53

… to out of date ColumnStats

Fix null types

5c95cef

Merge branch 'master' of https://github.com/apache/spark

a63eb59

viirya reviewed Mar 31, 2018

View reviewed changes

Add copying for unsafe data structures

426374b

viirya reviewed Mar 31, 2018

View reviewed changes

pr feedback

1479bde

viirya reviewed Apr 1, 2018

View reviewed changes

extra test make Map orderable safe

6ea0919

pwoody mentioned this pull request Jul 26, 2018

[SPARK-24934][SQL] Explicitly whitelist supported types in upper/lower bounds for in-memory partition pruning #21882

Closed

HyukjinKwon reviewed Jan 6, 2019

View reviewed changes

kiszk reviewed Jan 11, 2019

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 18, 2020

github-actions bot closed this Jan 19, 2020

[SPARK-23819][SQL] Fix InMemoryTableScanExec complex type pruning #20935

[SPARK-23819][SQL] Fix InMemoryTableScanExec complex type pruning #20935

Uh oh!

Conversation

pwoody commented Mar 29, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Mar 30, 2018

Uh oh!

gatorsmile commented Mar 30, 2018

Uh oh!

SparkQA commented Mar 30, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Apr 1, 2018

Uh oh!

SparkQA commented Apr 1, 2018

Uh oh!

SparkQA commented Apr 1, 2018

Uh oh!

pwoody commented Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwoody commented Apr 16, 2018

Uh oh!

pwoody commented May 3, 2018

pwoody commented Apr 6, 2018 •

edited

Loading

cloud-fan commented Jan 6, 2019 •

edited

Loading

kiszk Jan 11, 2019 •

edited

Loading