[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax #10003

CK50 · 2015-11-26T12:28:55Z

This PR moves the generation of the SQL Insert Statement into the JdbcDialect and allows dialects to add column names based on the RDD column names. The default behaviour stays unchanged. Only for the new dialect ProgressCassandraDialect column names will be inserted into the SQL Insert Statement.

I was wondering whether the logic for inserting the column names should be placed in some kind of generic Dialect instead of into the very specific ProgressCassandraDialect. Any suggestions welcome.

SparkQA · 2015-11-26T17:39:28Z

Test build #2119 has finished for PR 10003 at commit f2bf6ee.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-11-26T19:10:07Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/ProgressCassandraDialect.scala

what does progress mean here?

Hi Reynold,
Progress is a software vendor. They are/will be offering a JDBC driver for Cassandra.
Thanks,
Christian

Am 26.11.2015 um 20:10 schrieb Reynold Xin [email protected]:

In sql/core/src/main/scala/org/apache/spark/sql/jdbc/ProgressCassandraDialect.scala:

* Unless required by applicable law or agreed to in writing, software

* distributed under the License is distributed on an "AS IS" BASIS,

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

* See the License for the specific language governing permissions and

* limitations under the License.

*/
+
+package org.apache.spark.sql.jdbc
+
+import java.sql.Types
+
+import org.apache.spark.sql.types._
+
+
+private case object ProgressCassandraDialect extends JdbcDialect {
what does progress mean here?

—
Reply to this email directly or view it on GitHub.

Are there other Cassandra jdbc drivers and is this the primary one people use? I'm thinking we should just drop "Progress" here if this is the primary driver people use.

Yes, there is a bunch of other Cassandra JDBC drivers out there. Progress is one of them. But seen that the Cassandra SQL syntax mandates the list of column names in the INSERT statement, it is likely that other drivers will require the same syntax. So I will remove "Progress" from the class name.

rxin · 2015-11-26T19:10:55Z

@CK50 this needs some style cleanup - you can run dev/lint-scala to check styles locally.

Can you please also attach the generated query before / after this change? IIUC, maybe it's ok to have this for all dialects.

…of aliases and real columns this is based on #9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <[email protected]> Author: Wenchen Fan <[email protected]> Closes #9961 from cloud-fan/sort.

…from maven, we need to try to download the version that is used by Spark If we need to download Hive/Hadoop artifacts, try to download a Hadoop that matches the Hadoop used by Spark. If the Hadoop artifact cannot be resolved (e.g. Hadoop version is a vendor specific version like 2.0.0-cdh4.1.1), we will use Hadoop 2.4.0 (we used to hard code this version as the hadoop that we will download from maven) and we will not share Hadoop classes. I tested this match in my laptop with the following confs (these confs are used by our builds). All tests are good. ``` build/sbt -Phadoop-1 -Dhadoop.version=1.2.1 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Pyarn -Phadoop-2.2 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive-thriftserver -Phive ``` Author: Yin Huai <[email protected]> Closes #9979 from yhuai/versionsSuite.

This is a followup for #9959. I added more documentation and rewrote some monadic code into simpler ifs. Author: Reynold Xin <[email protected]> Closes #9995 from rxin/SPARK-11973.

Author: muxator <[email protected]> Closes #10008 from muxator/patch-1.

In the previous implementation, the driver needs to know the executor listening address to send the thread dump request. However, in Netty RPC, the executor doesn't listen to any port, so the executor thread dump feature is broken. This patch makes the driver use the endpointRef stored in BlockManagerMasterEndpoint to send the thread dump request to fix it. Author: Shixiong Zhu <[email protected]> Closes #9976 from zsxwing/executor-thread-dump.

…rguments Spark SQL aggregate function: ```Java stddev stddev_pop stddev_samp variance var_pop var_samp skewness kurtosis collect_list collect_set ``` should support ```columnName``` as arguments like other aggregate function(max/min/count/sum). Author: Yanbo Liang <[email protected]> Closes #9994 from yanboliang/SPARK-12011.

Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting. This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran. Author: mariusvniekerk <[email protected]> Closes #9861 from mariusvniekerk/SPARK-11881.

Author: Jeff Zhang <[email protected]> Closes #9903 from zjffdu/SPARK-11917.

Fix regression test for SPARK-11778. marmbrus Could you please take a look? Thank you very much!! Author: Huaxin Gao <[email protected]> Closes #9890 from huaxingao/spark-11778-regression-test.

If `--private-ips` is required but not provided, spark_ec2.py may behave inappropriately, including attempting to ssh to localhost in attempts to verify ssh connectivity to the cluster. This fixes that behavior by raising a `UsageError` exception if `get_dns_name` is unable to determine a hostname as a result. Author: Jeremy Derr <[email protected]> Closes #9975 from jcderr/SPARK-11991/ec_spark.py_hostname_check.

…ned by long column Check for partition column null-ability while building the partition spec. Author: Dilip Biswal <[email protected]> Closes #10001 from dilipbiswal/spark-11997.

srowen · 2015-11-27T12:17:32Z

This is an orthogonal issue, but it occurs to me that we have a bunch of dialects, and a bunch of behavior variations, but typically a variation isn't specific to a dialect, necessarily. How do we let a new variation apply to N dialects at once? this may become a problem soon and need some refactoring.

CK50 · 2015-11-27T17:15:39Z

@rxin
I wish I could run dev/lint-scala, but even after hours of struggling I
cannot get build/sbt running. I have downloaded sbt-launcher.jar
manually, but now I am stuck with sbt trying to download extra files
from a variety of invalid urls such as
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar
Any ideas appreciated. The Spark 1.6 build page did not help. - Any
other page I should follow?

The old generated statement is

INSERT INTO mytable VALUES (?, ?, ..., ?)

whereas the new statement is

INSERT INTO mytable (col1, col2, ..., colN) VALUES (?, ?, ..., ?)

So the old syntax relies on the column positions, whereas the new one
relies on the column names.

Column names are taken from the DataFrame. So DataFrame column names
must match target table column names.

At least for Oracle adding the column names is fine. The benefit of this
change is that you can write out DataFrames with having less columns
than the target table, which I think is not possible today. The downside
is that DataFrame column names must match. For best backwards
compatibility I only wanted to provide column names when it is really
needed, like in case of Cassandra.

WDYT?

On 26.11.2015 20:11, Reynold Xin wrote:

@CK50 https://github.com/CK50 this needs some style cleanup - you
can run dev/lint-scala to check styles locally.

Can you please also attach the generated query before / after this
change? IIUC, maybe it's ok to have this for all dialects.

—
Reply to this email directly or view it on GitHub
#10003 (comment).

srowen · 2015-11-27T18:49:05Z

I'm not sure what the issue is with SBT (do you get download errors or something?) but you can also see the results of the style checker in the build output: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2119/consoleFull
Search for "[error]"

I tend to think it's better practice to specify column names explicitly. Is there a downside? I don't recall a database that wouldn't let you specify them, right?

…ialect

CK50 · 2015-11-27T19:23:59Z

Do I take this as "SBT works fine for me on master"?

Yes, build/sbt fails for me when doing downloads:

Getting org.scala-sbt sbt 0.13.7 ...

:: problems summary ::
:::: WARNINGS
module not found: org.scala-sbt#sbt;0.13.7
==== local: tried
/home/ckurz/.ivy2/local/org.scala-sbt/sbt/0.13.7/ivys/ivy.xml
-- artifact org.scala-sbt#sbt;0.13.7!sbt.jar:
/home/ckurz/.ivy2/local/org.scala-sbt/sbt/0.13.7/jars/sbt.jar
==== jcenter: tried
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
-- artifact org.scala-sbt#sbt;0.13.7!sbt.jar:
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar
==== typesafe-ivy-releases: tried
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.7/ivys/ivy.xml
==== Maven Central: tried
https://repo1.maven.org/maven2/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
-- artifact org.scala-sbt#sbt;0.13.7!sbt.jar:
https://repo1.maven.org/maven2/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.scala-sbt#sbt;0.13.7: not found
::::::::::::::::::::::::::::::::::::::::::::::

:::: ERRORS
Server access Error: Connection timed out
url=https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
Server access Error: Connection timed out
url=https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar
Server access Error: Connection timed out
url=https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.7/ivys/ivy.xml
Server access Error: Connection timed out
url=https://repo1.maven.org/maven2/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
Server access Error: Connection timed out
url=https://repo1.maven.org/maven2/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar

I am behind a proxy and have tried configuring linux env vars and
JAVA_OPTS/SBT_OPTS, but this did not help.

The thing, which strikes me is that when I try accessing these urls:
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
from within a browser, I also get 404. When looking into
https://jcenter.bintray.com/org/scala-sbt there is no sub-folder called
sbt. - No idea what is going wrong here :-(

Back to the PR:
I will upload the suggested style changes.

In general I agree that column names should be given by default. I just
have two concerns re making this the new default behaviour:

does this work on all supported DBs? - Does Spark have automated
tests with appropriate envs for this?
is it okay to assume that DataFrame column names always match target
col names? (Spark currently does not rely on this. Instead it relies on
column positions, which is usually not good practice.)
is such a change okay with Spark's rules for backwards compatibility?

Just some concerns, I am fine with making column names the new default
behaviour. But as I said, I do not have the background to judge this.

Please let me know how to proceed.

On 27.11.2015 19:49, Sean Owen wrote:

I'm not sure what the issue is with SBT (do you get download errors or
something?) but you can also see the results of the style checker in
the build output:
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2119/consoleFull
Search for "[error]"

I tend to think it's better practice to specify column names
explicitly. Is there a downside? I don't recall a database that
/wouldn't/ let you specify them, right?

—
Reply to this email directly or view it on GitHub
#10003 (comment).

Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side. There are two reasons that we should make this change: * We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645) * Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0. It's better to fix this issue before 1.6 release, otherwise we will make breaking API change. cc shivaram sun-rui Author: Yanbo Liang <[email protected]> Closes #10016 from yanboliang/SPARK-12025.

…ingListenerSuite In StreamingListenerSuite."don't call ssc.stop in listener", after the main thread calls `ssc.stop()`, `StreamingContextStoppingCollector` may call `ssc.stop()` in the listener bus thread, which is a dead-lock. This PR updated `StreamingContextStoppingCollector` to only call `ssc.stop()` in the first batch to avoid the dead-lock. Author: Shixiong Zhu <[email protected]> Closes #10011 from zsxwing/fix-test-deadlock.

… 2.0 test https://issues.apache.org/jira/browse/SPARK-12020 Author: Yin Huai <[email protected]> Closes #10010 from yhuai/SPARK-12020.

…the value is null literals When calling `get_json_object` for the following two cases, both results are `"null"`: ```scala val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil val df: DataFrame = tuple.toDF("key", "jstring") val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` ```scala val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil val df2: DataFrame = tuple2.toDF("key", "jstring") val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect() ``` Fixed the problem and also added a test case. Author: gatorsmile <[email protected]> Closes #10018 from gatorsmile/get_json_object.

srowen · 2015-11-28T09:13:07Z

It worked fine for me locally since I have sbt available locally. I tried it on another machine so that it has to download sbt. Although it seemed to take some time, it succeeded. It looks maybe like a transient problem accessing the repos? I didn't seem to connect to bintray for some reason, but indeed those URLs look like they don't work. But they're not in our build. Hm. Try again?

(This will need a rebase/force push BTW)

…, tests, fix doc and add examples shivaram sun-rui Author: felixcheung <[email protected]> Closes #10019 from felixcheung/rfunctionsdoc.

Add support for for colnames, colnames<-, coltypes<- Also added tests for names, names<- which have no test previously. I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218 shivaram sun-rui Author: felixcheung <[email protected]> Closes #9654 from felixcheung/colnamescoltypes.

Author: Sun Rui <[email protected]> Closes #9769 from sun-rui/SPARK-11781.

In #9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null. This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path. cc yhuai Author: Herman van Hovell <[email protected]> Closes #10015 from hvanhovell/SPARK-12024.

… Parquet relation with decimal column". https://issues.apache.org/jira/browse/SPARK-12039 Since it is pretty flaky in hadoop 1 tests, we can disable it while we are investigating the cause. Author: Yin Huai <[email protected]> Closes #10035 from yhuai/SPARK-12039-ignore.

…form zk://host:port for a multi-master Mesos cluster using ZooKeeper * According to below doc and validation logic in [SparkSubmit.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L231), master URL for a mesos cluster should always start with `mesos://` http://spark.apache.org/docs/latest/running-on-mesos.html `The Master URLs for Mesos are in the form mesos://host:5050 for a single-master Mesos cluster, or mesos://zk://host:2181 for a multi-master Mesos cluster using ZooKeeper.` * However, [SparkContext.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2749) fails the validation and can receive master URL in the form `zk://host:port` * For the master URLs in the form `zk:host:port`, the valid form should be `mesos://zk://host:port` * This PR restrict the validation in `SparkContext.scala`, and now only mesos master URLs prefixed with `mesos://` can be accepted. * This PR also updated corresponding unit test. Author: toddwan <[email protected]> Closes #9886 from toddwan/S11859.

…here to support SBT pom reader only. Author: Prashant Sharma <[email protected]> Closes #10012 from ScrapCodes/minor-build-comment.

Top is implemented in terms of takeOrdered, which already maintains the order, so top should, too. Author: Wieland Hoffmann <[email protected]> Closes #10013 from mineo/top-order.

this is a trivial fix, discussed [here](http://stackoverflow.com/questions/28500401/maven-assembly-plugin-warning-the-assembly-descriptor-contains-a-filesystem-roo/). Author: Prashant Sharma <[email protected]> Closes #10014 from ScrapCodes/assembly-warning.

…ing database supports transactions Fixes [SPARK-11989](https://issues.apache.org/jira/browse/SPARK-11989) Author: CK50 <[email protected]> Author: Christian Kurz <[email protected]> Closes #9973 from CK50/branch-1.6_non-transactional. (cherry picked from commit a589736) Signed-off-by: Reynold Xin <[email protected]>

…ialect

…to master-SPARK-12010 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

CK50 · 2015-11-30T16:11:14Z

I have tried to rebase this PR to the lastest of apache/spark. - Does this look like expected?

srowen · 2015-11-30T16:39:46Z

No, this branch has a lot of unrelated commits now. Generally you keep your fork's master up to date with the main one, and then brings yours up to date, then switch to your branch and git rebase master.

You can hard-reset back over the bad commits or just start over.

srowen · 2015-12-01T09:13:34Z

@CK50 do you mind closing this PR? maybe easier than rebasing and untangling it.

CK50 · 2015-12-01T11:42:00Z

Need to create a new pull request for this fix.

Initial version

f2bf6ee

rxin reviewed Nov 26, 2015
View reviewed changes

dilipbiswal and others added 12 commits November 26, 2015 11:31

[SPARK-11973][SQL] Improve optimizer code readability.

de28e4d

This is a followup for #9959. I added more documentation and rewrote some monadic code into simpler ifs. Author: Reynold Xin <[email protected]> Closes #9995 from rxin/SPARK-11973.

doc typo: "classificaion" -> "classification"

4376b5b

Author: muxator <[email protected]> Closes #10008 from muxator/patch-1.

[SPARK-11917][PYSPARK] Add SQLContext#dropTempTable to PySpark

d822088

Author: Jeff Zhang <[email protected]> Closes #9903 from zjffdu/SPARK-11917.

[SPARK-11778][SQL] add regression test

4d4cbc0

Fix regression test for SPARK-11778. marmbrus Could you please take a look? Thank you very much!! Author: Huaxin Gao <[email protected]> Closes #9890 from huaxingao/spark-11778-regression-test.

Fix style violation for b63938a

10e315c

[SPARK-11997] [SQL] NPE when save a DataFrame as parquet and partitio…

a374e20

…ned by long column Check for partition column null-ability while building the partition spec. Author: Dilip Biswal <[email protected]> Closes #10001 from dilipbiswal/spark-11997.

Style changes and renamed from ProgressCassandraDialect to CassandraD…

d176b37

…ialect

yanboliang and others added 5 commits November 27, 2015 11:48

Clean style check run using Maven invocation

dd44460

[SPARK-12020][TESTS][TEST-HADOOP2.0] PR builder cannot trigger hadoop…

b992152

… 2.0 test https://issues.apache.org/jira/browse/SPARK-12020 Author: Yin Huai <[email protected]> Closes #10010 from yhuai/SPARK-12020.

felixcheung and others added 3 commits November 28, 2015 21:02

[SPARK-12029][SPARKR] Improve column functions signature, param check…

28e46ab

…, tests, fix doc and add examples shivaram sun-rui Author: felixcheung <[email protected]> Closes #10019 from felixcheung/rfunctionsdoc.

[SPARK-11781][SPARKR] SparkR has problem in inferring type of raw type.

cc7a1bc

Author: Sun Rui <[email protected]> Closes #9769 from sun-rui/SPARK-11781.

hvanhovell and others added 11 commits November 29, 2015 14:13

[MINOR][BUILD] Changed the comment to reflect the plugin project is t…

953e8e6

…here to support SBT pom reader only. Author: Prashant Sharma <[email protected]> Closes #10012 from ScrapCodes/minor-build-comment.

[DOC] Explicitly state that top maintains the order of elements

26c3581

Top is implemented in terms of takeOrdered, which already maintains the order, so top should, too. Author: Wieland Hoffmann <[email protected]> Closes #10013 from mineo/top-order.

Initial version

024475a

Style changes and renamed from ProgressCassandraDialect to CassandraD…

95b2538

…ialect

Clean style check run using Maven invocation

be3e31c

Merge branch 'master-SPARK-12010' of https://github.com/CK50/spark in…

5a2ac6f

…to master-SPARK-12010 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

CK50 closed this Dec 1, 2015

srowen mentioned this pull request Dec 1, 2015

[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax #10066

Closed

[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax #10003

[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax #10003

Uh oh!

Conversation

CK50 commented Nov 26, 2015

Uh oh!

SparkQA commented Nov 26, 2015

Uh oh!

rxin Nov 26, 2015

Choose a reason for hiding this comment

Uh oh!

CK50 Nov 26, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Nov 26, 2015

Choose a reason for hiding this comment

Uh oh!

CK50 Nov 27, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Nov 26, 2015

Uh oh!

srowen commented Nov 27, 2015

Uh oh!

CK50 commented Nov 27, 2015

Uh oh!

srowen commented Nov 27, 2015

Uh oh!

CK50 commented Nov 27, 2015

Uh oh!

srowen commented Nov 28, 2015

Uh oh!

CK50 commented Nov 30, 2015

Uh oh!

srowen commented Nov 30, 2015

Uh oh!

srowen commented Dec 1, 2015

Uh oh!

CK50 commented Dec 1, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants