-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax #10066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
f2bf6ee
Initial version
CK50 d176b37
Style changes and renamed from ProgressCassandraDialect to CassandraD…
CK50 dd44460
Clean style check run using Maven invocation
CK50 024475a
Initial version
CK50 95b2538
Style changes and renamed from ProgressCassandraDialect to CassandraD…
CK50 be3e31c
Clean style check run using Maven invocation
CK50 5a2ac6f
Merge branch 'master-SPARK-12010' of https://github.com/CK50/spark in…
CK50 c63696d
Merge branch 'master' into master-SPARK-12010
CK50 cac0536
Added support for columnMappings to fix SPARK-012010
CK50 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
53 changes: 53 additions & 0 deletions
53
sql/core/src/main/scala/org/apache/spark/sql/jdbc/CassandraDialect.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,53 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.jdbc | ||
|
|
||
| import java.sql.Types | ||
|
|
||
| import org.apache.spark.sql.types._ | ||
|
|
||
|
|
||
| private case object CassandraDialect extends JdbcDialect { | ||
|
|
||
| override def canHandle(url: String): Boolean = | ||
| url.startsWith("jdbc:datadirect:cassandra") || | ||
| url.startsWith("jdbc:weblogic:cassandra") | ||
|
|
||
| override def getInsertStatement(table: String, rddSchema: StructType): String = { | ||
| val sql = new StringBuilder(s"INSERT INTO $table ( ") | ||
| var fieldsLeft = rddSchema.fields.length | ||
| var i = 0 | ||
| // Build list of column names | ||
| while (fieldsLeft > 0) { | ||
| sql.append(rddSchema.fields(i).name) | ||
| if (fieldsLeft > 1) sql.append(", ") | ||
| fieldsLeft = fieldsLeft - 1 | ||
| i = i + 1 | ||
| } | ||
| sql.append(" ) VALUES ( ") | ||
| // Build values clause | ||
| fieldsLeft = rddSchema.fields.length | ||
| while (fieldsLeft > 0) { | ||
| sql.append("?") | ||
| if (fieldsLeft > 1) sql.append(", ") | ||
| fieldsLeft = fieldsLeft - 1 | ||
| } | ||
| sql.append(" ) ") | ||
| return sql.toString() | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits: braces and newline for the
if; usea += 1instead ofa = a + 1in the two lines below; extra blank line above near the imports.You should probably also make this more idiomatic and compact. For example this
whileloop collapses torddSchema.fields.map(_.name).mkString(", "), I believe. Similarly for the finalwhile. And then the entire method doesn't need to manually build it up withStringBuilder. This is probably a couple lines of code.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sean: I am working on your suggested changes. Looks like the code can
be collapsed into one or two lines.
Meanwhile I have found that inserting into an Oracle table having more
columns than columns in the dataframe results in
java.sql.BatchUpdateException: ORA-00947: not enough values if there are
any unmapped columns.
This does not matter, as long as the table matches exactly the
dataframe. But as soon as someone wants to insert into an existing table
with more columns than the dataframe has, this is a problem.
So it may indeed be better to include the suggested change for other
technologies as well.
The key question I see is: Is it okay to rely on the dataframe column
names matching the target table column names?
If so, do you suggest changing the default behaviour to include column
names for all dialects?
Does Spark automated tests have coverage for different databases? /
Would any regression be caught prior to merge?
btw,
re squashing commits: I will try, but for now I need to better
understand how all of this works in GitHub.
On 01.12.2015 13:17, Sean Owen wrote:
Oracle http://www.oracle.com
Christian Kurz | Consulting Member of Technical Staff
Phone: +49 228 30899431 tel:+49%20228%2030899431 | Mobile: +49 170
2964124 tel:+49%20170%202964124
Oracle Product Development
ORACLE Deutschland B.V. & Co. KG | Hamborner Str. 51 | 40472 Düsseldorf
ORACLE Deutschland B.V. & Co. KG
Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603
Komplementärin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Alexander van der Ven, Astrid Kepper, Val Maher
Green Oracle http://www.oracle.com/commitment Oracle is committed to
developing practices and products that help protect the environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're just saying that inserting a DataFrame of m columns into a table of n > m columns doesn't work, right? Yes without column name mappings, I expect this to fail anytime m != n, for any database. Right now this assumes m = n implicitly.
You're right that adding names requires a mapping from data frame column names to DB column names. Hm, I wonder if this needs an optional
Mapallowing for overrides.I don't think the regression tests cover all databases, no. I also don't think this can be specific to Oracle anyway.
My workflow for squashing N of the last commits is:
git rebase -i HEAD~Ngit push --force origin [your branch]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is exactly the problem encountered.
What if we drop the new Dialect and add a new signature on
DataFrameWriter instead (new columnMapping param):
def jdbc(url: String, table: String, connectionProperties: Properties,
columnMapping: Map<String,String>): Unit
The old signature then continues using the column-name-free INSERT
syntax, but for any advanced use-cases (or technologies, which do not
support column-name-free syntax) the new API can be used.
This ensures full backwards compatibility for all technologies
If this is the way to go, I'd better start a new PR?
My preference would still to keep the refactoring of moving generation
of INSERT statement into the Dialect (instead of in JDBCUtils). Does
this make sense?
On 02.12.2015 11:46, Sean Owen wrote:
Oracle http://www.oracle.com
Christian Kurz | Consulting Member of Technical Staff
Phone: +49 228 30899431 tel:+49%20228%2030899431 | Mobile: +49 170
2964124 tel:+49%20170%202964124
Oracle Product Development
ORACLE Deutschland B.V. & Co. KG | Hamborner Str. 51 | 40472 Düsseldorf
ORACLE Deutschland B.V. & Co. KG
Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603
Komplementärin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Alexander van der Ven, Astrid Kepper, Val Maher
Green Oracle http://www.oracle.com/commitment Oracle is committed to
developing practices and products that help protect the environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin are you the best person to ask about overloading
DataFrameWriter.jdbc()?Interesting question about maintaining the current behavior when no column name mapping is specified. In a way it still seems suboptimal to allow this behavior. What if there are the same number of columns, and all are the same type, but the ordering is different? you'd silently insert the wrong data in the wrong column.
Although specifying the DataFrame-to-table column name mapping can be optional (or, the caller can override only the names they want to) I think the SQL statement should be explicit. It does mean that someone who has a DataFrame with differently-named columns somehow might now encounter an exception, but I wonder if that's actually the right thing to enforce going forward. If it doesn't match up by name, don't proceed.
The API changes will take some care to make sure it's unintrusive and backwards compatible.
I suspect it doesn't do much harm to keep the insert statement logic in
JdbcDialectsthough I imagine this behavior, whatever we decide, will be the right thing for all dialects, so it can be a default implementation there.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just go ahead and add the new overload of
jdbc().There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will provide PR shortly
On 10.12.2015 23:44, Sean Owen wrote:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened PR #10312. - Please let me know your thoughts. - Thanks, Christian
On 10.12.2015 23:44, Sean Owen wrote: