Skip to content

Conversation

@pwoody
Copy link

@pwoody pwoody commented Mar 29, 2018

What changes were proposed in this pull request?

This PR allows recording of upper/lower bound values in ColumnStats if the data type is orderable.

How was this patch tested?

Added tests to ColumnStatsSuite and InMemoryColumnarQuerySuite.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

cc @kiszk

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88757 has finished for PR 20935 at commit 5c95cef.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ordering.foreach { order =>
val value = row.get(ordinal, dataType)
if (upper == null || order.gt(value, upper)) upper = value
if (lower == null || order.lt(value, lower)) lower = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unsafe row and array, doesn't we need to copy the value? In the added test this can't be tested because the random rows are all individual instances, however, it can be the same instance of unsafe row or array during query evaluation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching this.

testColumnStats(classOf[DoubleColumnStats], DOUBLE, Array(Double.MaxValue, Double.MinValue, 0))
testColumnStats(classOf[StringColumnStats], STRING, Array(null, null, 0))
testDecimalColumnStats(Array(null, null, 0))
testColumnStats(classOf[BooleanColumnStats], BOOLEAN, Array(true, false, 0, 0, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes to testColumnStats seems unnecessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column statistics have 5 fields in their array, so the zip comparison on the initial stats will drop the final two.

unsafeRow.getMap(0).copy
}
toUnsafeMap(ArrayBasedMapData(
Map(Random.nextInt() -> UTF8String.fromString(Random.nextString(Random.nextInt(32))))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems above changes to data generation are unnecessary too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ColumnType for Maps/Struct/Array all end up casting to their Unsafe structures to get the size for the statistics, so the test data will need to reflect that as well.


private[columnar] final class ObjectColumnStats(dataType: DataType) extends ColumnStats {
val columnType = ColumnType(dataType)
private abstract class OrderableSafeColumnStats[T](dataType: DataType) extends ColumnStats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderableObjectColumnStats?

private val ordering = dataType match {
case x if RowOrdering.isOrderable(dataType) =>
Option(TypeUtils.getInterpretedOrdering(x))
case _ => None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this class is only for "orderable", maybe we don't need optional here and ordering can just be Ordering[T].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for DataTypes that could be orderable since Arrays and Structs may have children data types that aren't.

gatherValueStats(value)
} else {
gatherNullStats
gatherNullStats()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the change to gatherNullStats is necessary...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was mostly from the scala style guide since this mutates the backing stats. http://docs.scala-lang.org/style/method-invocation.html#arity-0
I don't have a strong opinion though, so happy to swap it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just swap it back to make the diff small.

}
}

def testStructColumnStats(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we merge testArrayColumnStats, testMapColumnStats and testStructColumnStats?

}

private[columnar] final class StructColumnStats(dataType: DataType)
extends OrderableSafeColumnStats[InternalRow](dataType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternalRow -> UnsafeRow? Looks like for struct, the column type is specified for UnsafeRow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as for the ArrayData above.

}

private[columnar] final class ArrayColumnStats(dataType: DataType)
extends OrderableSafeColumnStats[ArrayData](dataType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayData -> UnsafeArrayData?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be scoping it down? The API for InternalRow gives back ArrayData, so we'd need a cast to do so.

test(s"${dataType.typeName}: non-empty") {
import org.apache.spark.sql.execution.columnar.ColumnarTestUtils._
val objectStats = new ArrayColumnStats(dataType)
val rows = Seq.fill(10)(makeRandomRow(columnType)) ++ Seq.fill(10)(makeNullRow(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't reuse the unsafe array/row here, we don't actually test on the copying in corresponding column statistics, can we have the test data reusing the unsafe structures to test array and struct column statistics?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do.

@SparkQA
Copy link

SparkQA commented Apr 1, 2018

Test build #88785 has finished for PR 20935 at commit 426374b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def copy(value: UnsafeArrayData): UnsafeArrayData = value.copy()
}

private[columnar] final class StructColumnStats(dataType: DataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataType: DataType -> dataType: StructType?

Array[Any](lower, upper, nullCount, count, sizeInBytes)
}

private[columnar] final class ArrayColumnStats(dataType: DataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataType: DataType -> dataType: ArrayType?

override def copy(value: UnsafeRow): UnsafeRow = value.copy()
}

private[columnar] final class MapColumnStats(dataType: DataType) extends ColumnStats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataType: DataType -> dataType: MapType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a TODO that we need to make this use OrderableSafeColumnStats when MapType is orderable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention it - we can just have it use it now since it will always fall through to the unorderable case. Everything will just work when we make it orderable w/o code change here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sounds good to me.

}
}

test("Reuse UnsafeArrayData for stats") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also test against UnsafeRow too.

@viirya
Copy link
Member

viirya commented Apr 1, 2018

cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Apr 1, 2018

Test build #88786 has finished for PR 20935 at commit 1479bde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2018

Test build #88797 has finished for PR 20935 at commit 6ea0919.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwoody
Copy link
Author

pwoody commented Apr 6, 2018

@cloud-fan @gatorsmile @kiszk - any thoughts on this PR?

@pwoody
Copy link
Author

pwoody commented Apr 16, 2018

Ping @cloud-fan @gatorsmile @kiszk

@pwoody
Copy link
Author

pwoody commented May 3, 2018

Anything else to be done here?

@HyukjinKwon
Copy link
Member

ok to test

@cloud-fan
Copy link
Contributor

I'd like to review this PR after the parquet nested column pruning is merged.

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93060 has finished for PR 20935 at commit 6ea0919.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.types.{AtomicType, Decimal}
import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, UnsafeArrayData, UnsafeMapData, UnsafeProjection}
import org.apache.spark.sql.catalyst.util.ArrayBasedMapData
import org.apache.spark.sql.types.{AtomicType, DataType, Decimal, IntegerType, MapType, StringType, StructField, StructType}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it imports more then 5, wlidcard can be used as well per style guide.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

can you fix #21882 back since that PR whitelisted the types

@cloud-fan
Copy link
Contributor

cloud-fan commented Jan 6, 2019

will this share the same infra with the parquet nested column pruning?

@SparkQA
Copy link

SparkQA commented Jan 6, 2019

Test build #100819 has finished for PR 20935 at commit 6ea0919.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

sizeInBytes += columnType.actualSize(row, ordinal)
count += 1
ordering.foreach { order =>
val value = getValue(row, ordinal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we move this statement out of foreach since this is loop-invariant?

sizeInBytes += size
sizeInBytes += columnType.actualSize(row, ordinal)
count += 1
ordering.foreach { order =>
Copy link
Member

@kiszk kiszk Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have more than one elements in ordering? If not, can we write this without foreach? It could achieve better performance.

@kiszk
Copy link
Member

kiszk commented Feb 24, 2019

kindly ping @pwoody

@HyukjinKwon
Copy link
Member

ping @pwoody

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@kiszk
Copy link
Member

kiszk commented Oct 2, 2019

gentle ping @pwoody

@kiszk
Copy link
Member

kiszk commented Oct 9, 2019

@pwoody @HyukjinKwon @viirya May I take over this since he did not respond for a long time?

@viirya
Copy link
Member

viirya commented Oct 9, 2019

I think it is fine as his last response is more than 1 yr ago.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 18, 2020
@github-actions github-actions bot closed this Jan 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants