Skip to content

Conversation

@CK50
Copy link
Contributor

@CK50 CK50 commented Nov 26, 2015

This PR moves the generation of the SQL Insert Statement into the JdbcDialect and allows dialects to add column names based on the RDD column names. The default behaviour stays unchanged. Only for the new dialect ProgressCassandraDialect column names will be inserted into the SQL Insert Statement.

I was wondering whether the logic for inserting the column names should be placed in some kind of generic Dialect instead of into the very specific ProgressCassandraDialect. Any suggestions welcome.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #2119 has finished for PR 10003 at commit f2bf6ee.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does progress mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Reynold,
Progress is a software vendor. They are/will be offering a JDBC driver for Cassandra.
Thanks,
Christian

Am 26.11.2015 um 20:10 schrieb Reynold Xin [email protected]:

In sql/core/src/main/scala/org/apache/spark/sql/jdbc/ProgressCassandraDialect.scala:

  • * Unless required by applicable law or agreed to in writing, software
  • * distributed under the License is distributed on an "AS IS" BASIS,
  • * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  • * See the License for the specific language governing permissions and
  • * limitations under the License.
  • */
    +
    +package org.apache.spark.sql.jdbc
    +
    +import java.sql.Types
    +
    +import org.apache.spark.sql.types._
    +
    +
    +private case object ProgressCassandraDialect extends JdbcDialect {
    what does progress mean here?


Reply to this email directly or view it on GitHub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other Cassandra jdbc drivers and is this the primary one people use? I'm thinking we should just drop "Progress" here if this is the primary driver people use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a bunch of other Cassandra JDBC drivers out there. Progress is one of them. But seen that the Cassandra SQL syntax mandates the list of column names in the INSERT statement, it is likely that other drivers will require the same syntax. So I will remove "Progress" from the class name.

@rxin
Copy link
Contributor

rxin commented Nov 26, 2015

@CK50 this needs some style cleanup - you can run dev/lint-scala to check styles locally.

Can you please also attach the generated query before / after this change? IIUC, maybe it's ok to have this for all dialects.

dilipbiswal and others added 12 commits November 26, 2015 11:31
…of aliases and real columns

this is based on #9844, with some bug fix and clean up.

The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.

whoever merge this PR, please give the credit to dilipbiswal

Author: Dilip Biswal <[email protected]>
Author: Wenchen Fan <[email protected]>

Closes #9961 from cloud-fan/sort.
…from maven, we need to try to download the version that is used by Spark

If we need to download Hive/Hadoop artifacts, try to download a Hadoop that matches the Hadoop used by Spark. If the Hadoop artifact cannot be resolved (e.g. Hadoop version is a vendor specific version like 2.0.0-cdh4.1.1), we will use Hadoop 2.4.0 (we used to hard code this version as the hadoop that we will download from maven) and we will not share Hadoop classes.

I tested this match in my laptop with the following confs (these confs are used by our builds). All tests are good.
```
build/sbt -Phadoop-1 -Dhadoop.version=1.2.1 -Pkinesis-asl -Phive-thriftserver -Phive
build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Pkinesis-asl -Phive-thriftserver -Phive
build/sbt -Pyarn -Phadoop-2.2 -Pkinesis-asl -Phive-thriftserver -Phive
build/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive-thriftserver -Phive
```

Author: Yin Huai <[email protected]>

Closes #9979 from yhuai/versionsSuite.
This is a followup for #9959.

I added more documentation and rewrote some monadic code into simpler ifs.

Author: Reynold Xin <[email protected]>

Closes #9995 from rxin/SPARK-11973.
Author: muxator <[email protected]>

Closes #10008 from muxator/patch-1.
In the previous implementation, the driver needs to know the executor listening address to send the thread dump request. However, in Netty RPC, the executor doesn't listen to any port, so the executor thread dump feature is broken.

This patch makes the driver use the endpointRef stored in BlockManagerMasterEndpoint to send the thread dump request to fix it.

Author: Shixiong Zhu <[email protected]>

Closes #9976 from zsxwing/executor-thread-dump.
…rguments

Spark SQL aggregate function:
```Java
stddev
stddev_pop
stddev_samp
variance
var_pop
var_samp
skewness
kurtosis
collect_list
collect_set
```
should support ```columnName``` as arguments like other aggregate function(max/min/count/sum).

Author: Yanbo Liang <[email protected]>

Closes #9994 from yanboliang/SPARK-12011.
Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor
In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting.

This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran.

Author: mariusvniekerk <[email protected]>

Closes #9861 from mariusvniekerk/SPARK-11881.
Fix regression test for SPARK-11778.
 marmbrus
Could you please take a look?
Thank you very much!!

Author: Huaxin Gao <[email protected]>

Closes #9890 from huaxingao/spark-11778-regression-test.
If `--private-ips` is required but not provided, spark_ec2.py may behave inappropriately, including attempting to ssh to localhost in attempts to verify ssh connectivity to the cluster.

This fixes that behavior by raising a `UsageError` exception if `get_dns_name` is unable to determine a hostname as a result.

Author: Jeremy Derr <[email protected]>

Closes #9975 from jcderr/SPARK-11991/ec_spark.py_hostname_check.
…ned by long column

Check for partition column null-ability while building the partition spec.

Author: Dilip Biswal <[email protected]>

Closes #10001 from dilipbiswal/spark-11997.
@srowen
Copy link
Member

srowen commented Nov 27, 2015

This is an orthogonal issue, but it occurs to me that we have a bunch of dialects, and a bunch of behavior variations, but typically a variation isn't specific to a dialect, necessarily. How do we let a new variation apply to N dialects at once? this may become a problem soon and need some refactoring.

@CK50
Copy link
Contributor Author

CK50 commented Nov 27, 2015

@rxin
I wish I could run dev/lint-scala, but even after hours of struggling I
cannot get build/sbt running. I have downloaded sbt-launcher.jar
manually, but now I am stuck with sbt trying to download extra files
from a variety of invalid urls such as
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar
Any ideas appreciated. The Spark 1.6 build page did not help. - Any
other page I should follow?

The old generated statement is

INSERT INTO mytable VALUES (?, ?, ..., ?)

whereas the new statement is

INSERT INTO mytable (col1, col2, ..., colN) VALUES (?, ?, ..., ?)

So the old syntax relies on the column positions, whereas the new one
relies on the column names.

Column names are taken from the DataFrame. So DataFrame column names
must match target table column names.

At least for Oracle adding the column names is fine. The benefit of this
change is that you can write out DataFrames with having less columns
than the target table, which I think is not possible today. The downside
is that DataFrame column names must match. For best backwards
compatibility I only wanted to provide column names when it is really
needed, like in case of Cassandra.

WDYT?

On 26.11.2015 20:11, Reynold Xin wrote:

@CK50 https://github.com/CK50 this needs some style cleanup - you
can run dev/lint-scala to check styles locally.

Can you please also attach the generated query before / after this
change? IIUC, maybe it's ok to have this for all dialects.


Reply to this email directly or view it on GitHub
#10003 (comment).

@srowen
Copy link
Member

srowen commented Nov 27, 2015

I'm not sure what the issue is with SBT (do you get download errors or something?) but you can also see the results of the style checker in the build output: https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2119/consoleFull
Search for "[error]"

I tend to think it's better practice to specify column names explicitly. Is there a downside? I don't recall a database that wouldn't let you specify them, right?

@CK50
Copy link
Contributor Author

CK50 commented Nov 27, 2015

Do I take this as "SBT works fine for me on master"?

Yes, build/sbt fails for me when doing downloads:

Getting org.scala-sbt sbt 0.13.7 ...

:: problems summary ::
:::: WARNINGS
module not found: org.scala-sbt#sbt;0.13.7
==== local: tried
/home/ckurz/.ivy2/local/org.scala-sbt/sbt/0.13.7/ivys/ivy.xml
-- artifact org.scala-sbt#sbt;0.13.7!sbt.jar:
/home/ckurz/.ivy2/local/org.scala-sbt/sbt/0.13.7/jars/sbt.jar
==== jcenter: tried
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
-- artifact org.scala-sbt#sbt;0.13.7!sbt.jar:
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar
==== typesafe-ivy-releases: tried
https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.7/ivys/ivy.xml
==== Maven Central: tried
https://repo1.maven.org/maven2/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
-- artifact org.scala-sbt#sbt;0.13.7!sbt.jar:
https://repo1.maven.org/maven2/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.scala-sbt#sbt;0.13.7: not found
::::::::::::::::::::::::::::::::::::::::::::::

:::: ERRORS
Server access Error: Connection timed out
url=https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
Server access Error: Connection timed out
url=https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar
Server access Error: Connection timed out
url=https://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt/0.13.7/ivys/ivy.xml
Server access Error: Connection timed out
url=https://repo1.maven.org/maven2/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
Server access Error: Connection timed out
url=https://repo1.maven.org/maven2/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.jar

I am behind a proxy and have tried configuring linux env vars and
JAVA_OPTS/SBT_OPTS, but this did not help.

The thing, which strikes me is that when I try accessing these urls:
https://jcenter.bintray.com/org/scala-sbt/sbt/0.13.7/sbt-0.13.7.pom
from within a browser, I also get 404. When looking into
https://jcenter.bintray.com/org/scala-sbt there is no sub-folder called
sbt. - No idea what is going wrong here :-(


Back to the PR:
I will upload the suggested style changes.

In general I agree that column names should be given by default. I just
have two concerns re making this the new default behaviour:

  1. does this work on all supported DBs? - Does Spark have automated
    tests with appropriate envs for this?
  2. is it okay to assume that DataFrame column names always match target
    col names? (Spark currently does not rely on this. Instead it relies on
    column positions, which is usually not good practice.)
  3. is such a change okay with Spark's rules for backwards compatibility?

Just some concerns, I am fine with making column names the new default
behaviour. But as I said, I do not have the background to judge this.

Please let me know how to proceed.

On 27.11.2015 19:49, Sean Owen wrote:

I'm not sure what the issue is with SBT (do you get download errors or
something?) but you can also see the results of the style checker in
the build output:
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2119/consoleFull
Search for "[error]"

I tend to think it's better practice to specify column names
explicitly. Is there a downside? I don't recall a database that
/wouldn't/ let you specify them, right?


Reply to this email directly or view it on GitHub
#10003 (comment).

yanboliang and others added 5 commits November 27, 2015 11:48
Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
There are two reasons that we should make this change:
* We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
* Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.

It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
cc shivaram sun-rui

Author: Yanbo Liang <[email protected]>

Closes #10016 from yanboliang/SPARK-12025.
…ingListenerSuite

In StreamingListenerSuite."don't call ssc.stop in listener", after the main thread calls `ssc.stop()`,  `StreamingContextStoppingCollector` may call  `ssc.stop()` in the listener bus thread, which is a dead-lock. This PR updated `StreamingContextStoppingCollector` to only call `ssc.stop()` in the first batch to avoid the dead-lock.

Author: Shixiong Zhu <[email protected]>

Closes #10011 from zsxwing/fix-test-deadlock.
…the value is null literals

When calling `get_json_object` for the following two cases, both results are `"null"`:

```scala
    val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil
    val df: DataFrame = tuple.toDF("key", "jstring")
    val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect()
```
```scala
    val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil
    val df2: DataFrame = tuple2.toDF("key", "jstring")
    val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect()
```

Fixed the problem and also added a test case.

Author: gatorsmile <[email protected]>

Closes #10018 from gatorsmile/get_json_object.
@srowen
Copy link
Member

srowen commented Nov 28, 2015

It worked fine for me locally since I have sbt available locally. I tried it on another machine so that it has to download sbt. Although it seemed to take some time, it succeeded. It looks maybe like a transient problem accessing the repos? I didn't seem to connect to bintray for some reason, but indeed those URLs look like they don't work. But they're not in our build. Hm. Try again?

(This will need a rebase/force push BTW)

felixcheung and others added 3 commits November 28, 2015 21:02
…, tests, fix doc and add examples

shivaram sun-rui

Author: felixcheung <[email protected]>

Closes #10019 from felixcheung/rfunctionsdoc.
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218

shivaram sun-rui

Author: felixcheung <[email protected]>

Closes #9654 from felixcheung/colnamescoltypes.
hvanhovell and others added 11 commits November 29, 2015 14:13
In #9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null.

This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path.

cc yhuai

Author: Herman van Hovell <[email protected]>

Closes #10015 from hvanhovell/SPARK-12024.
… Parquet relation with decimal column".

https://issues.apache.org/jira/browse/SPARK-12039

Since it is pretty flaky in hadoop 1 tests, we can disable it while we are investigating the cause.

Author: Yin Huai <[email protected]>

Closes #10035 from yhuai/SPARK-12039-ignore.
…form zk://host:port for a multi-master Mesos cluster using ZooKeeper

* According to below doc and validation logic in [SparkSubmit.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L231), master URL for a mesos cluster should always start with `mesos://`

http://spark.apache.org/docs/latest/running-on-mesos.html
`The Master URLs for Mesos are in the form mesos://host:5050 for a single-master Mesos cluster, or mesos://zk://host:2181 for a multi-master Mesos cluster using ZooKeeper.`

* However, [SparkContext.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2749) fails the validation and can receive master URL in the form `zk://host:port`

* For the master URLs in the form `zk:host:port`, the valid form should be `mesos://zk://host:port`

* This PR restrict the validation in `SparkContext.scala`, and now only mesos master URLs prefixed with `mesos://` can be accepted.

* This PR also updated corresponding unit test.

Author: toddwan <[email protected]>

Closes #9886 from toddwan/S11859.
…here to support SBT pom reader only.

Author: Prashant Sharma <[email protected]>

Closes #10012 from ScrapCodes/minor-build-comment.
Top is implemented in terms of takeOrdered, which already maintains the
order, so top should, too.

Author: Wieland Hoffmann <[email protected]>

Closes #10013 from mineo/top-order.
…ing database supports transactions

Fixes [SPARK-11989](https://issues.apache.org/jira/browse/SPARK-11989)

Author: CK50 <[email protected]>
Author: Christian Kurz <[email protected]>

Closes #9973 from CK50/branch-1.6_non-transactional.

(cherry picked from commit a589736)
Signed-off-by: Reynold Xin <[email protected]>
…to master-SPARK-12010

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
	sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala
@CK50
Copy link
Contributor Author

CK50 commented Nov 30, 2015

I have tried to rebase this PR to the lastest of apache/spark. - Does this look like expected?

@srowen
Copy link
Member

srowen commented Nov 30, 2015

No, this branch has a lot of unrelated commits now. Generally you keep your fork's master up to date with the main one, and then brings yours up to date, then switch to your branch and git rebase master.

You can hard-reset back over the bad commits or just start over.

@srowen
Copy link
Member

srowen commented Dec 1, 2015

@CK50 do you mind closing this PR? maybe easier than rebasing and untangling it.

@CK50
Copy link
Contributor Author

CK50 commented Dec 1, 2015

Need to create a new pull request for this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.