[SPARK-7186] [SQL] Decouple internal Row from external Row #6792

davies · 2015-06-12T21:53:29Z

Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations.

As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users.

For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters.

For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow.

cc @rxin @marmbrus @JoshRosen

marmbrus · 2015-06-12T22:02:51Z

Thanks for working on this huge change! Should InternalRow live in catalyst with the other semi-private APIs?

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateUtilsSuite.scala sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnStatsSuite.scala sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnarTestUtils.scala

rxin · 2015-06-13T02:52:28Z

How come Jenkins didn't print any meaning messages? cc @JoshRosen

rxin · 2015-06-13T02:56:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala

why not just import InternalRow?

This is done by Intellj.

Okay, but we aren't even consistent in our usage. Can we remove this and just reference InternalRow?

+1 i think we should fix this

Will do in a follow up PR. Please continue review this.

JoshRosen · 2015-06-13T02:57:23Z

Jenkins, retest this please.

JoshRosen · 2015-06-13T02:59:22Z

I looked at the Jenkins configuration log and it looks like @andrewor14's credentials somehow auto-filled and overwrote the Jenkins GitHub token; Andrew and I were modifying the builder configurations this afternoon to attach unit-tests.log outputs to the builds as build artifacts (we'll email the dev list later with more details on this feature).

I've rolled back the configuration change so hopefully we'll see SparkQA posting soon.

For those with Jenkins admin access, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/jobConfigHistory/showDiffFiles?timestamp1=2015-06-12_09-30-32&timestamp2=2015-06-12_14-37-42

SparkQA · 2015-06-13T04:51:46Z

Test build #34816 has finished for PR 6792 at commit f2abd13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-06-13T06:06:15Z

Going to merge this quickly since it conflicts with a lot of other patches.

marmbrus · 2015-06-13T21:30:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala

Is this cast needed?

not needed, will remove it.

marmbrus · 2015-06-14T04:07:21Z

Another nit:

[warn] /home/michael/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala:24: imported `InternalRow' is permanently hidden by definition of object InternalRow in package catalyst

Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations. As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users. For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters. For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow. cc rxin marmbrus JoshRosen Author: Davies Liu <[email protected]> Closes apache#6792 from davies/internal_row and squashes the following commits: f2abd13 [Davies Liu] fix scalastyle a7e025c [Davies Liu] move InternalRow into catalyst 30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row 7cbced8 [Davies Liu] separate Row and InternalRow

separate Row and InternalRow

7cbced8

davies force-pushed the internal_row branch from 0c2775f to bed72ef Compare June 12, 2015 22:08

davies force-pushed the internal_row branch from bed72ef to 30db8ba Compare June 12, 2015 22:09

Davies Liu added 2 commits June 12, 2015 15:49

move InternalRow into catalyst

a7e025c

fix scalastyle

f2abd13

rxin reviewed Jun 13, 2015
View reviewed changes

asfgit closed this in d46f8e5 Jun 13, 2015

marmbrus reviewed Jun 13, 2015
View reviewed changes

davies mentioned this pull request Jun 18, 2015

[WIP] remove expensive api from InternalRow #6869

Closed

samperson1997 mentioned this pull request Jul 9, 2019

[IOTDB-91] Improve tsfile-spark-connector to support spark 2.4.3 apache/iotdb#227

Merged

[SPARK-7186] [SQL] Decouple internal Row from external Row #6792

[SPARK-7186] [SQL] Decouple internal Row from external Row #6792

Uh oh!

Conversation

davies commented Jun 12, 2015

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

rxin commented Jun 13, 2015

Uh oh!

rxin Jun 13, 2015

Choose a reason for hiding this comment

Uh oh!

davies Jun 13, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus Jun 13, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Jun 13, 2015

Choose a reason for hiding this comment

Uh oh!

davies Jun 13, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Jun 13, 2015

Uh oh!

JoshRosen commented Jun 13, 2015

Uh oh!

SparkQA commented Jun 13, 2015

Uh oh!

rxin commented Jun 13, 2015

Uh oh!

marmbrus Jun 13, 2015

Choose a reason for hiding this comment

Uh oh!

davies Jun 13, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Jun 14, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants