-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12010][SQL] Spark JDBC requires support for column-name-free INSERT syntax #10003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #2119 has finished for PR 10003 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does progress mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Reynold,
Progress is a software vendor. They are/will be offering a JDBC driver for Cassandra.
Thanks,
Christian
Am 26.11.2015 um 20:10 schrieb Reynold Xin [email protected]:
In sql/core/src/main/scala/org/apache/spark/sql/jdbc/ProgressCassandraDialect.scala:
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
+
+package org.apache.spark.sql.jdbc
+
+import java.sql.Types
+
+import org.apache.spark.sql.types._
+
+
+private case object ProgressCassandraDialect extends JdbcDialect {
what does progress mean here?—
Reply to this email directly or view it on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other Cassandra jdbc drivers and is this the primary one people use? I'm thinking we should just drop "Progress" here if this is the primary driver people use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a bunch of other Cassandra JDBC drivers out there. Progress is one of them. But seen that the Cassandra SQL syntax mandates the list of column names in the INSERT statement, it is likely that other drivers will require the same syntax. So I will remove "Progress" from the class name.
|
@CK50 this needs some style cleanup - you can run dev/lint-scala to check styles locally. Can you please also attach the generated query before / after this change? IIUC, maybe it's ok to have this for all dialects. |
…of aliases and real columns this is based on #9844, with some bug fix and clean up. The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`). For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression. whoever merge this PR, please give the credit to dilipbiswal Author: Dilip Biswal <[email protected]> Author: Wenchen Fan <[email protected]> Closes #9961 from cloud-fan/sort.
…from maven, we need to try to download the version that is used by Spark If we need to download Hive/Hadoop artifacts, try to download a Hadoop that matches the Hadoop used by Spark. If the Hadoop artifact cannot be resolved (e.g. Hadoop version is a vendor specific version like 2.0.0-cdh4.1.1), we will use Hadoop 2.4.0 (we used to hard code this version as the hadoop that we will download from maven) and we will not share Hadoop classes. I tested this match in my laptop with the following confs (these confs are used by our builds). All tests are good. ``` build/sbt -Phadoop-1 -Dhadoop.version=1.2.1 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Pyarn -Phadoop-2.2 -Pkinesis-asl -Phive-thriftserver -Phive build/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive-thriftserver -Phive ``` Author: Yin Huai <[email protected]> Closes #9979 from yhuai/versionsSuite.
This is a followup for #9959. I added more documentation and rewrote some monadic code into simpler ifs. Author: Reynold Xin <[email protected]> Closes #9995 from rxin/SPARK-11973.
Author: muxator <[email protected]> Closes #10008 from muxator/patch-1.
In the previous implementation, the driver needs to know the executor listening address to send the thread dump request. However, in Netty RPC, the executor doesn't listen to any port, so the executor thread dump feature is broken. This patch makes the driver use the endpointRef stored in BlockManagerMasterEndpoint to send the thread dump request to fix it. Author: Shixiong Zhu <[email protected]> Closes #9976 from zsxwing/executor-thread-dump.
…rguments Spark SQL aggregate function: ```Java stddev stddev_pop stddev_samp variance var_pop var_samp skewness kurtosis collect_list collect_set ``` should support ```columnName``` as arguments like other aggregate function(max/min/count/sum). Author: Yanbo Liang <[email protected]> Closes #9994 from yanboliang/SPARK-12011.
Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting. This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran. Author: mariusvniekerk <[email protected]> Closes #9861 from mariusvniekerk/SPARK-11881.
Author: Jeff Zhang <[email protected]> Closes #9903 from zjffdu/SPARK-11917.
Fix regression test for SPARK-11778. marmbrus Could you please take a look? Thank you very much!! Author: Huaxin Gao <[email protected]> Closes #9890 from huaxingao/spark-11778-regression-test.
If `--private-ips` is required but not provided, spark_ec2.py may behave inappropriately, including attempting to ssh to localhost in attempts to verify ssh connectivity to the cluster. This fixes that behavior by raising a `UsageError` exception if `get_dns_name` is unable to determine a hostname as a result. Author: Jeremy Derr <[email protected]> Closes #9975 from jcderr/SPARK-11991/ec_spark.py_hostname_check.
…ned by long column Check for partition column null-ability while building the partition spec. Author: Dilip Biswal <[email protected]> Closes #10001 from dilipbiswal/spark-11997.
|
This is an orthogonal issue, but it occurs to me that we have a bunch of dialects, and a bunch of behavior variations, but typically a variation isn't specific to a dialect, necessarily. How do we let a new variation apply to N dialects at once? this may become a problem soon and need some refactoring. |
|
@rxin The old generated statement is INSERT INTO mytable VALUES (?, ?, ..., ?) whereas the new statement is INSERT INTO mytable (col1, col2, ..., colN) VALUES (?, ?, ..., ?) So the old syntax relies on the column positions, whereas the new one Column names are taken from the DataFrame. So DataFrame column names At least for Oracle adding the column names is fine. The benefit of this WDYT? On 26.11.2015 20:11, Reynold Xin wrote:
|
|
I'm not sure what the issue is with SBT (do you get download errors or something?) but you can also see the results of the style checker in the build output: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2119/consoleFull I tend to think it's better practice to specify column names explicitly. Is there a downside? I don't recall a database that wouldn't let you specify them, right? |
|
Do I take this as "SBT works fine for me on master"? Yes, build/sbt fails for me when doing downloads: Getting org.scala-sbt sbt 0.13.7 ... :: problems summary :: :::: ERRORS I am behind a proxy and have tried configuring linux env vars and The thing, which strikes me is that when I try accessing these urls: Back to the PR: In general I agree that column names should be given by default. I just
Just some concerns, I am fine with making column names the new default Please let me know how to proceed. On 27.11.2015 19:49, Sean Owen wrote:
|
Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side. There are two reasons that we should make this change: * We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645) * Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0. It's better to fix this issue before 1.6 release, otherwise we will make breaking API change. cc shivaram sun-rui Author: Yanbo Liang <[email protected]> Closes #10016 from yanboliang/SPARK-12025.
…ingListenerSuite In StreamingListenerSuite."don't call ssc.stop in listener", after the main thread calls `ssc.stop()`, `StreamingContextStoppingCollector` may call `ssc.stop()` in the listener bus thread, which is a dead-lock. This PR updated `StreamingContextStoppingCollector` to only call `ssc.stop()` in the first batch to avoid the dead-lock. Author: Shixiong Zhu <[email protected]> Closes #10011 from zsxwing/fix-test-deadlock.
… 2.0 test https://issues.apache.org/jira/browse/SPARK-12020 Author: Yin Huai <[email protected]> Closes #10010 from yhuai/SPARK-12020.
…the value is null literals
When calling `get_json_object` for the following two cases, both results are `"null"`:
```scala
val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil
val df: DataFrame = tuple.toDF("key", "jstring")
val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect()
```
```scala
val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil
val df2: DataFrame = tuple2.toDF("key", "jstring")
val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect()
```
Fixed the problem and also added a test case.
Author: gatorsmile <[email protected]>
Closes #10018 from gatorsmile/get_json_object.
|
It worked fine for me locally since I have (This will need a rebase/force push BTW) |
…, tests, fix doc and add examples shivaram sun-rui Author: felixcheung <[email protected]> Closes #10019 from felixcheung/rfunctionsdoc.
Add support for for colnames, colnames<-, coltypes<- Also added tests for names, names<- which have no test previously. I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218 shivaram sun-rui Author: felixcheung <[email protected]> Closes #9654 from felixcheung/colnamescoltypes.
Author: Sun Rui <[email protected]> Closes #9769 from sun-rui/SPARK-11781.
In #9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null. This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path. cc yhuai Author: Herman van Hovell <[email protected]> Closes #10015 from hvanhovell/SPARK-12024.
… Parquet relation with decimal column". https://issues.apache.org/jira/browse/SPARK-12039 Since it is pretty flaky in hadoop 1 tests, we can disable it while we are investigating the cause. Author: Yin Huai <[email protected]> Closes #10035 from yhuai/SPARK-12039-ignore.
…form zk://host:port for a multi-master Mesos cluster using ZooKeeper * According to below doc and validation logic in [SparkSubmit.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L231), master URL for a mesos cluster should always start with `mesos://` http://spark.apache.org/docs/latest/running-on-mesos.html `The Master URLs for Mesos are in the form mesos://host:5050 for a single-master Mesos cluster, or mesos://zk://host:2181 for a multi-master Mesos cluster using ZooKeeper.` * However, [SparkContext.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2749) fails the validation and can receive master URL in the form `zk://host:port` * For the master URLs in the form `zk:host:port`, the valid form should be `mesos://zk://host:port` * This PR restrict the validation in `SparkContext.scala`, and now only mesos master URLs prefixed with `mesos://` can be accepted. * This PR also updated corresponding unit test. Author: toddwan <[email protected]> Closes #9886 from toddwan/S11859.
…here to support SBT pom reader only. Author: Prashant Sharma <[email protected]> Closes #10012 from ScrapCodes/minor-build-comment.
Top is implemented in terms of takeOrdered, which already maintains the order, so top should, too. Author: Wieland Hoffmann <[email protected]> Closes #10013 from mineo/top-order.
this is a trivial fix, discussed [here](http://stackoverflow.com/questions/28500401/maven-assembly-plugin-warning-the-assembly-descriptor-contains-a-filesystem-roo/). Author: Prashant Sharma <[email protected]> Closes #10014 from ScrapCodes/assembly-warning.
…ing database supports transactions Fixes [SPARK-11989](https://issues.apache.org/jira/browse/SPARK-11989) Author: CK50 <[email protected]> Author: Christian Kurz <[email protected]> Closes #9973 from CK50/branch-1.6_non-transactional. (cherry picked from commit a589736) Signed-off-by: Reynold Xin <[email protected]>
…to master-SPARK-12010 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
|
I have tried to rebase this PR to the lastest of apache/spark. - Does this look like expected? |
|
No, this branch has a lot of unrelated commits now. Generally you keep your fork's You can hard-reset back over the bad commits or just start over. |
|
@CK50 do you mind closing this PR? maybe easier than rebasing and untangling it. |
|
Need to create a new pull request for this fix. |
This PR moves the generation of the SQL Insert Statement into the JdbcDialect and allows dialects to add column names based on the RDD column names. The default behaviour stays unchanged. Only for the new dialect ProgressCassandraDialect column names will be inserted into the SQL Insert Statement.
I was wondering whether the logic for inserting the column names should be placed in some kind of generic Dialect instead of into the very specific ProgressCassandraDialect. Any suggestions welcome.