[SPARK-18752][hive] "isSrcLocal" value should be set from user query. #16179

vanzin · 2016-12-07T00:31:41Z

The value of the "isSrcLocal" parameter passed to Hive's loadTable and
loadPartition methods needs to be set according to the user query (e.g.
"LOAD DATA LOCAL"), and not the current code that tries to guess what
it should be.

For existing versions of Hive the current behavior is probably ok, but
some recent changes in the Hive code changed the semantics slightly,
making code that sets "isSrcLocal" to "true" incorrectly to do the
wrong thing. It would end up moving the parent directory of the files
into the final location, instead of the file themselves, resulting
in a table that cannot be read.

I modified HiveCommandSuite so that existing "LOAD DATA" tests are run
both in local and non-local mode, since the semantics are slightly different.
The tests include a few new checks to make sure the semantics follow
what Hive describes in its documentation.

Tested with existing unit tests and also ran some Hive integration tests
with a version of Hive containing the changes that surfaced the problem.

The value of the "isSrcLocal" parameter passed to Hive's loadTable and loadPartition methods needs to be set according to the user query (e.g. "LOAD DATA LOCAL"), and not the current code that tries to guess what it should be. For existing versions of Hive the current behavior is probably ok, but some recent changes in the Hive code changed the semantics slightly, making code that sets "isSrcLocal" to "true" incorrectly to do the wrong thing. It would end up moving the parent directory of the files into the final location, instead of the file themselves, resulting in a table that cannot be read.

SparkQA · 2016-12-07T02:12:01Z

Test build #69757 has finished for PR 16179 at commit f1e09f4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-12-07T03:16:59Z

Hmm, those tests passed locally... let me rebase.

vanzin · 2016-12-07T17:48:28Z

I think I know what the problem is, this will require some test changes.

Need to make a copy of the input when using "LOAD DATA" vs. "LOAD DATA LOCAL" since Hive moves the input file in the former case.

SparkQA · 2016-12-07T20:41:47Z

Test build #69805 has finished for PR 16179 at commit 93e07db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-12-07T20:54:04Z

retest this please

SparkQA · 2016-12-07T23:38:12Z

Test build #69817 has finished for PR 16179 at commit 93e07db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-12-07T23:41:32Z

Not sure who exactly to ping here, but let's try @cloud-fan and @yhuai

cloud-fan · 2016-12-08T15:48:23Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala

      TimeUnit.MILLISECONDS).asInstanceOf[Long]
  }

-  protected def isSrcLocal(path: Path, conf: HiveConf): Boolean = {


when will this be wrong?

For example, if your warehouse directory is in the local file system too (which happens during unit tests).

Could you show us which Hive JIRA made the change?

I don't know the exact Hive change, since I didn't actually do git bisect or anything to try to find it, but the closest one I found was HIVE-12988.

But that's kinda besides the point: the underlying issue is that the Spark code wasn't really behaving correctly w.r.t. the semantics of the "isSrcLocal" value. The value of that flag is defined by the user query, and cannot be implied by any other context. We were just lucky that things were working before.

gatorsmile · 2016-12-08T22:15:24Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

        overwrite,
-        holdDDLTime)
+        holdDDLTime,
+        isSrcLocal = false)


Then, how can we know this is always not a local file system (e.g., as you said above, if your warehouse directory is in the local file system too)?

We don't need to. "isSrcLocal" comes from the user query.

"LOAD DATA LOCAL" -> "isSrcLocal" = true
anything else -> "isSrcLocal" = false

I see the reason why we can set it to false. The files are created by us. We can set it to false and let Hive move it instead of copying it.

cloud-fan · 2016-12-09T08:23:27Z

a kind of unrelated question: what if users use LOAD DATA LOCAL but give a non-local path like hdfs://..., or what if users use LOAD DATA but give a local path like file://...?

vanzin · 2016-12-09T17:22:14Z

what if users use LOAD DATA LOCAL but give a non-local path

To be fair I don't even know what Hive does in those cases. Even its documentation is a little self-contradictory (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML). It says the path can be a URI, but if "LOCAL" is defined, it's always a local path. I'd have to try it on Hive but right now I'm kinda swamped with other things.

vanzin · 2016-12-09T19:36:12Z

To answer the first question:

> load data local inpath 'hdfs:/user/systest/data.csv' into table test;
Error: Error while compiling statement: FAILED: SemanticException [Error 10028]: Line 1:23 Path is not legal ''hdfs:/user/systest/data.csv'': Source file system should be "file" if "local" is specified (state=42000,code=10028)

vanzin · 2016-12-09T19:47:23Z

For the second question ("load data" with "file" URI) it seems to move the file from the local file system to the warehouse (as the Hive doc I linked above sort of suggests).

gatorsmile · 2016-12-09T22:35:02Z

https://issues.apache.org/jira/browse/HIVE-6024

Based on the above JIRA, Hive has an interal change in Hive 0.14. Do we need to add the related test cases to VersionSuite like what we did for CTAS in #16104?

vanzin · 2016-12-09T22:47:32Z

@gatorsmile that is not the change that surfaced the problem. Again, this change is not to work around a bug in Hive. This change is because Spark is doing things incorrectly, and we were just lucky to not hit this problem before. Hive is correct. Spark is not. Thus the change.

The changes that actually surface the problem are in Hive 2.1, which Spark does not yet support (officially at least?) as a metastore client.

Internally we have patches from Hive 2.1 in our Hive, so we started seeing this problem. Because Spark is behaving incorrectly, it's better to fix it to avoid future issues.

VersionSuite is not meant to capture these things; it's meant to make sure the HiveShim reflection code is doing the right thing for the various versions of Hive. In fact, it's not the test that failed for us (the "InsertIntoHiveTable" did).

gatorsmile · 2016-12-09T23:46:26Z

I can understand the existing way is not correct and we should use LOCAL in LOAD DATA command for populating the value of isSrcLocal. However, we also introduce behavior changes in InsertIntoHiveTable.

After the changes, we always set isSrcLocal to false for InsertIntoHiveTable. After the change these temporary data files in staging directory of InsertIntoHiveTable will be moved to the table location instead of copying to the table location. Is that right?

VersionSuite is also being used for testing end-to-end behaviors in #16104. In the future, we need to add more test cases to ensure the support of all the Hive versions. I think this is the right thing we should continue.

BTW, I am also trying to see whether the test case coverage of our LOAD DATA is complete or not.

vanzin · 2016-12-10T00:02:28Z

After the change these temporary data files in staging directory of InsertIntoHiveTable will be moved to the table location instead of copying to the table location. Is that right?

It depends. Without this change, it would depend on where the table was. If the table was in HDFS (or anything but the local FS), the files would be moved, so the behavior doesn't change. If the table was in the local filesystem, before this change the files would be copied, and later deleted when the staging directory was deleted. So in the end, it's the same thing.

With the change, the data would be moved in both cases, which is also correct and leads to the same result.

I just want to reinforce, again, that this is not about a change in behavior in Hive at all. This is Spark using a Hive API incorrectly.

VersionSuite is also being used for testing end-to-end behaviors in #16104.

I'm not sure that's such a great idea, but in any case, the tests for this change are the existing tests in "InsertIntoHiveTableSuite" and "HiveCommandSuite". So basically you'd be asking to run those against all different version of Hive metastores supported by Spark. It's doable, but that's a bigger change that I don't really think is necessary here. The Hive semantics haven't changed. Spark was depending on undocumented behavior that worked out of luck, and this change fixes that.

cloud-fan · 2016-12-10T05:19:48Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala

+
+  /**
+   * Run a function with a copy of the input file. Use this for tests that use "LOAD DATA"
+   * (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, files are moved


The semantic change happened in Hive 2.1, looks we don't need to update the tests for now?

Ah, the tests need to be updated because now loadTable is being called with "isSrcLocal = false". That makes the source file be moved instead of copied, and that makes subsequent unit tests fail. (That's the cause of the initial test failures in this PR.)

then can we test LOAD DATA and LOAD DATA LOCAL separately? We can add comments to explain the semantic difference between them and why we need to copy the file

Sure, I can move each of them into separate tests.

cloud-fan · 2016-12-10T05:24:16Z

The changes LGTM, as we do propagate the isSrcLocal incorrectly. It would be better if we can also fix the inconsistent behavior of LOAD DATA between spark and hive, and improve the test coverage, in a follow-up

gatorsmile · 2016-12-10T08:00:37Z

This PR is fixing an issue exposed in Hive 2.1. I am not very clear why Hive 2.1 made such a change. If we can know the background of this change, it might be easier for us to judge whether this is the only issue after the change. At least, I am thinking this PR does not make it worse. I am OK about this PR.

vanzin · 2016-12-10T19:23:07Z

This PR is fixing an issue exposed in Hive 2.1.

No. This PR is fixing a misuse of an internal hive API. It's important to understand that there's nothing wrong in Hive here, that it's Spark that is using Hive's internal API incorrectly and that it's safer for Spark to not do that.

vanzin · 2016-12-10T19:26:46Z

If it makes it easier for you to understand why this has nothing to do with Hive 2.1, look at the unit test changes. They show that Spark was behaving differently from Hive in that particular situation ("LOAD DATA" with a warehouse in the local file system - Hive would move the source file, Spark would copy it and leave it around).

To make sure both work as expected.

SparkQA · 2016-12-10T22:54:54Z

Test build #69973 has finished for PR 16179 at commit a8482a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-11T02:00:35Z

Test build #69976 has finished for PR 16179 at commit b451a70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-11T02:17:56Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala

+          sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table""")
+        }
+
+        intercept[AnalysisException] {


The error message is wrong.
LOAD DATA target table default.part_table is partitioned, but number of columns in provided partition spec (1) do not match number of partitioned columns in table (s2);

s2 is incorrect. We need to remove s from the following code:

s"(s${targetTable.partitionColumnNames.size})"

good catch! But I'd say it's not related to this PR, and I won't block merging this PR if this is the only issue.

BTW, it's in LoadDataCommand, line 206

gatorsmile · 2016-12-11T02:20:41Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala

+          sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table PARTITION(c="1")""")
+        }
+        intercept[AnalysisException] {
+          sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table PARTITION(d="1")""")


This negative case is identical to the same to the above. Are you expecting the different error message here?

I don't know. I just moved this code. It was already there before. (Note the partition definition is different, not that I know whether that matters for anything.)

cloud-fan · 2016-12-11T10:28:32Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala

      // employee.dat has two columns separated by '|', the first is an int, the second is a string.
      // Its content looks like:
      // 16|john
      // 17|robert


also move these comments?

cloud-fan · 2016-12-11T10:54:20Z

LGTM except 2 minor comments: #16179 (comment) and #16179 (comment)

SparkQA · 2016-12-12T20:40:16Z

Test build #70030 has finished for PR 16179 at commit 4e37c80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-12-12T22:13:27Z

LGTM

gatorsmile · 2016-12-12T22:20:35Z

Merging to master. Thanks!

gatorsmile · 2016-12-12T22:21:02Z

Should we merge it to Spark 2.1? cc @vanzin @rxin @cloud-fan

vanzin · 2016-12-12T22:42:52Z

Maybe after 2.1.0 goes out? It's not really a critical fix.

rxin · 2016-12-12T22:46:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala

      isOverwrite: Boolean,
-      holdDDLTime: Boolean): Unit
+      holdDDLTime: Boolean,
+      isSrcLocal: Boolean): Unit


what does isSrcLocal mean? Can you document it?

It means the source data comes from a "LOAD DATA LOCAL" query.

I can add a partial scaladoc to these methods, but I don't really know the meaning of some of the other arguments, so I can't write a complete one.

The value of the "isSrcLocal" parameter passed to Hive's loadTable and loadPartition methods needs to be set according to the user query (e.g. "LOAD DATA LOCAL"), and not the current code that tries to guess what it should be. For existing versions of Hive the current behavior is probably ok, but some recent changes in the Hive code changed the semantics slightly, making code that sets "isSrcLocal" to "true" incorrectly to do the wrong thing. It would end up moving the parent directory of the files into the final location, instead of the file themselves, resulting in a table that cannot be read. I modified HiveCommandSuite so that existing "LOAD DATA" tests are run both in local and non-local mode, since the semantics are slightly different. The tests include a few new checks to make sure the semantics follow what Hive describes in its documentation. Tested with existing unit tests and also ran some Hive integration tests with a version of Hive containing the changes that surfaced the problem. Author: Marcelo Vanzin <[email protected]> Closes apache#16179 from vanzin/SPARK-18752.

Merge branch 'master' into SPARK-18752

11f11ca

Fix HiveCommandSuite.

93e07db

Need to make a copy of the input when using "LOAD DATA" vs. "LOAD DATA LOCAL" since Hive moves the input file in the former case.

cloud-fan reviewed Dec 8, 2016

View reviewed changes

gatorsmile reviewed Dec 8, 2016

View reviewed changes

cloud-fan reviewed Dec 10, 2016

View reviewed changes

Run same batch of tests against LOAD DATA and LOAD DATA LOCAL.

a8482a5

To make sure both work as expected.

Better check for move vs. copy behavior.

b451a70

gatorsmile reviewed Dec 11, 2016

View reviewed changes

cloud-fan reviewed Dec 11, 2016

View reviewed changes

Feedback.

4e37c80

asfgit closed this in 476b34c Dec 12, 2016

rxin reviewed Dec 12, 2016

View reviewed changes

vanzin deleted the SPARK-18752 branch December 12, 2016 22:53

[SPARK-18752][hive] "isSrcLocal" value should be set from user query. #16179

[SPARK-18752][hive] "isSrcLocal" value should be set from user query. #16179

Uh oh!

Conversation

vanzin commented Dec 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 7, 2016

Uh oh!

vanzin commented Dec 7, 2016

Uh oh!

vanzin commented Dec 7, 2016

Uh oh!

SparkQA commented Dec 7, 2016

Uh oh!

vanzin commented Dec 7, 2016

Uh oh!

SparkQA commented Dec 7, 2016

Uh oh!

vanzin commented Dec 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Dec 9, 2016

Uh oh!

vanzin commented Dec 9, 2016

Uh oh!

vanzin commented Dec 9, 2016

Uh oh!

gatorsmile commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Dec 9, 2016

Uh oh!

gatorsmile commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Dec 10, 2016

Uh oh!

cloud-fan Dec 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 10, 2016

Uh oh!

gatorsmile commented Dec 10, 2016

Uh oh!

vanzin commented Dec 10, 2016

Uh oh!

vanzin commented Dec 10, 2016

Uh oh!

SparkQA commented Dec 10, 2016

Uh oh!

SparkQA commented Dec 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Dec 7, 2016 •

edited

Loading

cloud-fan commented Dec 9, 2016 •

edited

Loading

gatorsmile commented Dec 9, 2016 •

edited

Loading

gatorsmile commented Dec 9, 2016 •

edited

Loading

cloud-fan Dec 10, 2016 •

edited

Loading

gatorsmile Dec 11, 2016 •

edited

Loading