-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18752][hive] "isSrcLocal" value should be set from user query. #16179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The value of the "isSrcLocal" parameter passed to Hive's loadTable and loadPartition methods needs to be set according to the user query (e.g. "LOAD DATA LOCAL"), and not the current code that tries to guess what it should be. For existing versions of Hive the current behavior is probably ok, but some recent changes in the Hive code changed the semantics slightly, making code that sets "isSrcLocal" to "true" incorrectly to do the wrong thing. It would end up moving the parent directory of the files into the final location, instead of the file themselves, resulting in a table that cannot be read.
|
Test build #69757 has finished for PR 16179 at commit
|
|
Hmm, those tests passed locally... let me rebase. |
|
I think I know what the problem is, this will require some test changes. |
Need to make a copy of the input when using "LOAD DATA" vs. "LOAD DATA LOCAL" since Hive moves the input file in the former case.
|
Test build #69805 has finished for PR 16179 at commit
|
|
retest this please |
|
Test build #69817 has finished for PR 16179 at commit
|
|
Not sure who exactly to ping here, but let's try @cloud-fan and @yhuai |
| TimeUnit.MILLISECONDS).asInstanceOf[Long] | ||
| } | ||
|
|
||
| protected def isSrcLocal(path: Path, conf: HiveConf): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this be wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if your warehouse directory is in the local file system too (which happens during unit tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you show us which Hive JIRA made the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the exact Hive change, since I didn't actually do git bisect or anything to try to find it, but the closest one I found was HIVE-12988.
But that's kinda besides the point: the underlying issue is that the Spark code wasn't really behaving correctly w.r.t. the semantics of the "isSrcLocal" value. The value of that flag is defined by the user query, and cannot be implied by any other context. We were just lucky that things were working before.
| overwrite, | ||
| holdDDLTime) | ||
| holdDDLTime, | ||
| isSrcLocal = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, how can we know this is always not a local file system (e.g., as you said above, if your warehouse directory is in the local file system too)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to. "isSrcLocal" comes from the user query.
"LOAD DATA LOCAL" -> "isSrcLocal" = true
anything else -> "isSrcLocal" = false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the reason why we can set it to false. The files are created by us. We can set it to false and let Hive move it instead of copying it.
|
a kind of unrelated question: what if users use |
To be fair I don't even know what Hive does in those cases. Even its documentation is a little self-contradictory (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML). It says the path can be a URI, but if "LOCAL" is defined, it's always a local path. I'd have to try it on Hive but right now I'm kinda swamped with other things. |
|
To answer the first question: |
|
For the second question ("load data" with "file" URI) it seems to move the file from the local file system to the warehouse (as the Hive doc I linked above sort of suggests). |
|
https://issues.apache.org/jira/browse/HIVE-6024 Based on the above JIRA, Hive has an interal change in Hive 0.14. Do we need to add the related test cases to |
|
@gatorsmile that is not the change that surfaced the problem. Again, this change is not to work around a bug in Hive. This change is because Spark is doing things incorrectly, and we were just lucky to not hit this problem before. Hive is correct. Spark is not. Thus the change. The changes that actually surface the problem are in Hive 2.1, which Spark does not yet support (officially at least?) as a metastore client. Internally we have patches from Hive 2.1 in our Hive, so we started seeing this problem. Because Spark is behaving incorrectly, it's better to fix it to avoid future issues.
|
|
I can understand the existing way is not correct and we should use After the changes, we always set
BTW, I am also trying to see whether the test case coverage of our |
It depends. Without this change, it would depend on where the table was. If the table was in HDFS (or anything but the local FS), the files would be moved, so the behavior doesn't change. If the table was in the local filesystem, before this change the files would be copied, and later deleted when the staging directory was deleted. So in the end, it's the same thing. With the change, the data would be moved in both cases, which is also correct and leads to the same result. I just want to reinforce, again, that this is not about a change in behavior in Hive at all. This is Spark using a Hive API incorrectly.
I'm not sure that's such a great idea, but in any case, the tests for this change are the existing tests in "InsertIntoHiveTableSuite" and "HiveCommandSuite". So basically you'd be asking to run those against all different version of Hive metastores supported by Spark. It's doable, but that's a bigger change that I don't really think is necessary here. The Hive semantics haven't changed. Spark was depending on undocumented behavior that worked out of luck, and this change fixes that. |
|
|
||
| /** | ||
| * Run a function with a copy of the input file. Use this for tests that use "LOAD DATA" | ||
| * (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, files are moved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantic change happened in Hive 2.1, looks we don't need to update the tests for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the tests need to be updated because now loadTable is being called with "isSrcLocal = false". That makes the source file be moved instead of copied, and that makes subsequent unit tests fail. (That's the cause of the initial test failures in this PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then can we test LOAD DATA and LOAD DATA LOCAL separately? We can add comments to explain the semantic difference between them and why we need to copy the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can move each of them into separate tests.
|
The changes LGTM, as we do propagate the |
|
This PR is fixing an issue exposed in Hive 2.1. I am not very clear why Hive 2.1 made such a change. If we can know the background of this change, it might be easier for us to judge whether this is the only issue after the change. At least, I am thinking this PR does not make it worse. I am OK about this PR. |
No. This PR is fixing a misuse of an internal hive API. It's important to understand that there's nothing wrong in Hive here, that it's Spark that is using Hive's internal API incorrectly and that it's safer for Spark to not do that. |
|
If it makes it easier for you to understand why this has nothing to do with Hive 2.1, look at the unit test changes. They show that Spark was behaving differently from Hive in that particular situation ("LOAD DATA" with a warehouse in the local file system - Hive would move the source file, Spark would copy it and leave it around). |
To make sure both work as expected.
|
Test build #69973 has finished for PR 16179 at commit
|
|
Test build #69976 has finished for PR 16179 at commit
|
| sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table""") | ||
| } | ||
|
|
||
| intercept[AnalysisException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message is wrong.
LOAD DATA target table default.part_table is partitioned, but number of columns in provided partition spec (1) do not match number of partitioned columns in table (s2);
s2 is incorrect. We need to remove s from the following code:
s"(s${targetTable.partitionColumnNames.size})"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! But I'd say it's not related to this PR, and I won't block merging this PR if this is the only issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it's in LoadDataCommand, line 206
| sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table PARTITION(c="1")""") | ||
| } | ||
| intercept[AnalysisException] { | ||
| sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table PARTITION(d="1")""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This negative case is identical to the same to the above. Are you expecting the different error message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I just moved this code. It was already there before. (Note the partition definition is different, not that I know whether that matters for anything.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, I see.
| // employee.dat has two columns separated by '|', the first is an int, the second is a string. | ||
| // Its content looks like: | ||
| // 16|john | ||
| // 17|robert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also move these comments?
|
LGTM except 2 minor comments: #16179 (comment) and #16179 (comment) |
|
Test build #70030 has finished for PR 16179 at commit
|
|
LGTM |
|
Merging to master. Thanks! |
|
Should we merge it to Spark 2.1? cc @vanzin @rxin @cloud-fan |
|
Maybe after 2.1.0 goes out? It's not really a critical fix. |
| isOverwrite: Boolean, | ||
| holdDDLTime: Boolean): Unit | ||
| holdDDLTime: Boolean, | ||
| isSrcLocal: Boolean): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does isSrcLocal mean? Can you document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means the source data comes from a "LOAD DATA LOCAL" query.
I can add a partial scaladoc to these methods, but I don't really know the meaning of some of the other arguments, so I can't write a complete one.
The value of the "isSrcLocal" parameter passed to Hive's loadTable and loadPartition methods needs to be set according to the user query (e.g. "LOAD DATA LOCAL"), and not the current code that tries to guess what it should be. For existing versions of Hive the current behavior is probably ok, but some recent changes in the Hive code changed the semantics slightly, making code that sets "isSrcLocal" to "true" incorrectly to do the wrong thing. It would end up moving the parent directory of the files into the final location, instead of the file themselves, resulting in a table that cannot be read. I modified HiveCommandSuite so that existing "LOAD DATA" tests are run both in local and non-local mode, since the semantics are slightly different. The tests include a few new checks to make sure the semantics follow what Hive describes in its documentation. Tested with existing unit tests and also ran some Hive integration tests with a version of Hive containing the changes that surfaced the problem. Author: Marcelo Vanzin <[email protected]> Closes apache#16179 from vanzin/SPARK-18752.
The value of the "isSrcLocal" parameter passed to Hive's loadTable and loadPartition methods needs to be set according to the user query (e.g. "LOAD DATA LOCAL"), and not the current code that tries to guess what it should be. For existing versions of Hive the current behavior is probably ok, but some recent changes in the Hive code changed the semantics slightly, making code that sets "isSrcLocal" to "true" incorrectly to do the wrong thing. It would end up moving the parent directory of the files into the final location, instead of the file themselves, resulting in a table that cannot be read. I modified HiveCommandSuite so that existing "LOAD DATA" tests are run both in local and non-local mode, since the semantics are slightly different. The tests include a few new checks to make sure the semantics follow what Hive describes in its documentation. Tested with existing unit tests and also ran some Hive integration tests with a version of Hive containing the changes that surfaced the problem. Author: Marcelo Vanzin <[email protected]> Closes apache#16179 from vanzin/SPARK-18752.
The value of the "isSrcLocal" parameter passed to Hive's loadTable and
loadPartition methods needs to be set according to the user query (e.g.
"LOAD DATA LOCAL"), and not the current code that tries to guess what
it should be.
For existing versions of Hive the current behavior is probably ok, but
some recent changes in the Hive code changed the semantics slightly,
making code that sets "isSrcLocal" to "true" incorrectly to do the
wrong thing. It would end up moving the parent directory of the files
into the final location, instead of the file themselves, resulting
in a table that cannot be read.
I modified HiveCommandSuite so that existing "LOAD DATA" tests are run
both in local and non-local mode, since the semantics are slightly different.
The tests include a few new checks to make sure the semantics follow
what Hive describes in its documentation.
Tested with existing unit tests and also ran some Hive integration tests
with a version of Hive containing the changes that surfaced the problem.