Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Jun 12, 2015

Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations.

As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users.

For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters.

For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow.

cc @rxin @marmbrus @JoshRosen

@marmbrus
Copy link
Contributor

Thanks for working on this huge change! Should InternalRow live in catalyst with the other semi-private APIs?

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala
	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateUtilsSuite.scala
	sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala
	sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala
	sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala
	sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnStatsSuite.scala
	sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnarTestUtils.scala
@rxin
Copy link
Contributor

rxin commented Jun 13, 2015

How come Jenkins didn't print any meaning messages? cc @JoshRosen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just import InternalRow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done by Intellj.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but we aren't even consistent in our usage. Can we remove this and just reference InternalRow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 i think we should fix this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do in a follow up PR. Please continue review this.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@JoshRosen
Copy link
Contributor

I looked at the Jenkins configuration log and it looks like @andrewor14's credentials somehow auto-filled and overwrote the Jenkins GitHub token; Andrew and I were modifying the builder configurations this afternoon to attach unit-tests.log outputs to the builds as build artifacts (we'll email the dev list later with more details on this feature).

I've rolled back the configuration change so hopefully we'll see SparkQA posting soon.

For those with Jenkins admin access, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/jobConfigHistory/showDiffFiles?timestamp1=2015-06-12_09-30-32&timestamp2=2015-06-12_14-37-42

@SparkQA
Copy link

SparkQA commented Jun 13, 2015

Test build #34816 has finished for PR 6792 at commit f2abd13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 13, 2015

Going to merge this quickly since it conflicts with a lot of other patches.

@asfgit asfgit closed this in d46f8e5 Jun 13, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cast needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, will remove it.

@marmbrus
Copy link
Contributor

Another nit:

[warn] /home/michael/spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala:24: imported `InternalRow' is permanently hidden by definition of object InternalRow in package catalyst

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations.

As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users.

For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters.

For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow.

cc rxin marmbrus JoshRosen

Author: Davies Liu <[email protected]>

Closes apache#6792 from davies/internal_row and squashes the following commits:

f2abd13 [Davies Liu] fix scalastyle
a7e025c [Davies Liu] move InternalRow into catalyst
30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row
7cbced8 [Davies Liu] separate Row and InternalRow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants