-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7186] [SQL] Decouple internal Row from external Row #6792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for working on this huge change! Should |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateUtilsSuite.scala sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnStatsSuite.scala sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnarTestUtils.scala
|
How come Jenkins didn't print any meaning messages? cc @JoshRosen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just import InternalRow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done by Intellj.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but we aren't even consistent in our usage. Can we remove this and just reference InternalRow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 i think we should fix this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in a follow up PR. Please continue review this.
|
Jenkins, retest this please. |
|
I looked at the Jenkins configuration log and it looks like @andrewor14's credentials somehow auto-filled and overwrote the Jenkins GitHub token; Andrew and I were modifying the builder configurations this afternoon to attach I've rolled back the configuration change so hopefully we'll see SparkQA posting soon. For those with Jenkins admin access, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/jobConfigHistory/showDiffFiles?timestamp1=2015-06-12_09-30-32×tamp2=2015-06-12_14-37-42 |
|
Test build #34816 has finished for PR 6792 at commit
|
|
Going to merge this quickly since it conflicts with a lot of other patches. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cast needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed, will remove it.
|
Another nit: |
Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations. As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users. For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters. For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow. cc rxin marmbrus JoshRosen Author: Davies Liu <[email protected]> Closes apache#6792 from davies/internal_row and squashes the following commits: f2abd13 [Davies Liu] fix scalastyle a7e025c [Davies Liu] move InternalRow into catalyst 30db8ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into internal_row 7cbced8 [Davies Liu] separate Row and InternalRow
Currently, we use o.a.s.sql.Row both internally and externally. The external interface is wider than what the internal needs because it is designed to facilitate end-user programming. This design has proven to be very error prone and cumbersome for internal Row implementations.
As a first step, we create an InternalRow interface in the catalyst module, which is identical to the current Row interface. And we switch all internal operators/expressions to use this InternalRow instead. When we need to expose Row, we convert the InternalRow implementation into Row for users.
For all public API, we use Row (for example, data source APIs), which will be converted into/from InternalRow by CatalystTypeConverters.
For all internal data sources (Json, Parquet, JDBC, Hive), we use InternalRow for better performance, casted into Row in buildScan() (without change the public API). When create a PhysicalRDD, we cast them back to InternalRow.
cc @rxin @marmbrus @JoshRosen