Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Hive confs in hive-site.xml will be loaded in hadoopConf, so we should use hadoopConf in InsertIntoHiveTable instead of SessionState.conf

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

cc @yhuai

@SparkQA
Copy link

SparkQA commented Aug 14, 2016

Test build #63746 has finished for PR 14634 at commit 64268f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Great! It resolves my original concern. Thanks!

@gatorsmile
Copy link
Member

hive.exec.dynamic.partition also impacts our regular writing paths (DataFrameWriter APIs). I remember DataFrameWriter APIs always assume this conf is true, right?

If it is controlled by hadoopConf, users might hit strange errors when using our DataFrameWriter APIs.

@cloud-fan
Copy link
Contributor Author

hive.exec.dynamic.partition also impacts our regular writing paths

I think it's hive only conf? Normal data source relation should not read this conf.

@gatorsmile
Copy link
Member

gatorsmile commented Aug 15, 2016

If users want to use the DataFrameWriter's insertInto API for partitioned Hive table, they have to set hive.exec.dynamic.partition to true and hive.exec.dynamic.partition.mode to nonstrict. Otherwise, it does not work. Users are unable to specify the partition values in the insertInto APIs. See the code.

Let me use a test case to show the issue.

    withTempDir { tmpDir =>
      val basePath = tmpDir.getCanonicalPath
      val externalTab = "extTable_with_partitions"
      withTable(externalTab) {
        assert(tmpDir.listFiles.isEmpty)
        sql(
          s"""
             |CREATE EXTERNAL TABLE $externalTab (key INT, value STRING)
             |PARTITIONED BY (ds STRING, hr STRING)
             |stored as Parquet
             |LOCATION '$basePath'
           """.stripMargin)

        // SQL interface
        withSQLConf("hive.exec.dynamic.partition" -> "false") {
          for (ds <- Seq("2008-04-08", "2008-04-09"); hr <- Seq("11", "12")) {
            sql(
              s"""
                 |INSERT OVERWRITE TABLE $externalTab
                 |partition (ds='$ds',hr='$hr')
                 |SELECT 1, 'a'
               """.stripMargin)
          }
        }

        // DataFrameWriter interface
        withSQLConf("hive.exec.dynamic.partition" -> "true",
            "hive.exec.dynamic.partition.mode" -> "nonstrict") {
          Seq((1, "2", "2008-04-09", "12")).toDF("key", "value", "ds", "hr").write
            .insertInto(externalTab)
        }
      }
    }

@gatorsmile
Copy link
Member

Sorry, let me rephrase the potential issue.

  • insertInto API forces users to set hive.exec.dynamic.partition to true and hive.exec.dynamic.partition.mode to nonstrict. This might not be convinient. The default value of hive.exec.dynamic.partition.mode is strict.
  • If we always read the setting from hadoopConf, does that mean users are unable to control these settings for different queries? Is that possible users can change the setting values in hadoopConf at runtime?

@cloud-fan
Copy link
Contributor Author

  1. hadoopConf contains all the confs from SQL conf, see SessionState.newHadoopConf
  2. users can change hadoopConf at runtime.
  3. the default value is from hive, I'm ok to not follow it.

@gatorsmile
Copy link
Member

uh, I see. Thank you! No more question. : )

@yhuai
Copy link
Contributor

yhuai commented Aug 15, 2016

Sorry. What's the necessity to make this change?

@gatorsmile
Copy link
Member

Based on my understanding, after this PR, we will respect the conf values of hive.exec.dynamic.partition, hive.exec.dynamic.partition.mode and hive.exec.compress.output that are specified in hive-site.xml?

@JoshRosen
Copy link
Contributor

@yhuai, @cloud-fan What's the status of this issue? Should this still be targeted for 2.0.1 (which it is currently in JIRA)?

@cloud-fan
Copy link
Contributor Author

cc @yhuai , In InsertIntoHiveTable we already called newHadoopConf, I think it's safer to get these hive confs from hadoopConf instead of sqlConf, to respect hive confs in hive-site.xml. We can check other places one by one later.

@cloud-fan
Copy link
Contributor Author

retest this please

@cloud-fan
Copy link
Contributor Author

@JoshRosen this is a regression in 2.0(it works in 1.6), so I think we should target it to 2.0

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65620 has finished for PR 14634 at commit 64268f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Sep 20, 2016

This change looks good. Let's add a regression test.

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65629 has finished for PR 14634 at commit 90fbe4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HiveQuerySuite extends HiveComparisonTest with SQLTestUtils with BeforeAndAfter

@yhuai
Copy link
Contributor

yhuai commented Sep 20, 2016

LGTM. Merging to master and branch 2.0.

asfgit pushed a commit that referenced this pull request Sep 20, 2016
## What changes were proposed in this pull request?

Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf`

## How was this patch tested?

N/A

Author: Wenchen Fan <[email protected]>

Closes #14634 from cloud-fan/bug.

(cherry picked from commit eb004c6)
Signed-off-by: Yin Huai <[email protected]>
@asfgit asfgit closed this in eb004c6 Sep 20, 2016
@cloud-fan cloud-fan deleted the bug branch December 14, 2016 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants