Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Dec 7, 2016

The value of the "isSrcLocal" parameter passed to Hive's loadTable and
loadPartition methods needs to be set according to the user query (e.g.
"LOAD DATA LOCAL"), and not the current code that tries to guess what
it should be.

For existing versions of Hive the current behavior is probably ok, but
some recent changes in the Hive code changed the semantics slightly,
making code that sets "isSrcLocal" to "true" incorrectly to do the
wrong thing. It would end up moving the parent directory of the files
into the final location, instead of the file themselves, resulting
in a table that cannot be read.

I modified HiveCommandSuite so that existing "LOAD DATA" tests are run
both in local and non-local mode, since the semantics are slightly different.
The tests include a few new checks to make sure the semantics follow
what Hive describes in its documentation.

Tested with existing unit tests and also ran some Hive integration tests
with a version of Hive containing the changes that surfaced the problem.

The value of the "isSrcLocal" parameter passed to Hive's loadTable and
loadPartition methods needs to be set according to the user query (e.g.
"LOAD DATA LOCAL"), and not the current code that tries to guess what
it should be.

For existing versions of Hive the current behavior is probably ok, but
some recent changes in the Hive code changed the semantics slightly,
making code that sets "isSrcLocal" to "true" incorrectly to do the
wrong thing. It would end up moving the parent directory of the files
into the final location, instead of the file themselves, resulting
in a table that cannot be read.
@SparkQA
Copy link

SparkQA commented Dec 7, 2016

Test build #69757 has finished for PR 16179 at commit f1e09f4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 7, 2016

Hmm, those tests passed locally... let me rebase.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 7, 2016

I think I know what the problem is, this will require some test changes.

Need to make a copy of the input when using "LOAD DATA" vs.
"LOAD DATA LOCAL" since Hive moves the input file in the
former case.
@SparkQA
Copy link

SparkQA commented Dec 7, 2016

Test build #69805 has finished for PR 16179 at commit 93e07db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 7, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Dec 7, 2016

Test build #69817 has finished for PR 16179 at commit 93e07db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 7, 2016

Not sure who exactly to ping here, but let's try @cloud-fan and @yhuai

TimeUnit.MILLISECONDS).asInstanceOf[Long]
}

protected def isSrcLocal(path: Path, conf: HiveConf): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this be wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if your warehouse directory is in the local file system too (which happens during unit tests).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you show us which Hive JIRA made the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the exact Hive change, since I didn't actually do git bisect or anything to try to find it, but the closest one I found was HIVE-12988.

But that's kinda besides the point: the underlying issue is that the Spark code wasn't really behaving correctly w.r.t. the semantics of the "isSrcLocal" value. The value of that flag is defined by the user query, and cannot be implied by any other context. We were just lucky that things were working before.

overwrite,
holdDDLTime)
holdDDLTime,
isSrcLocal = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, how can we know this is always not a local file system (e.g., as you said above, if your warehouse directory is in the local file system too)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to. "isSrcLocal" comes from the user query.

"LOAD DATA LOCAL" -> "isSrcLocal" = true
anything else -> "isSrcLocal" = false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reason why we can set it to false. The files are created by us. We can set it to false and let Hive move it instead of copying it.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 9, 2016

a kind of unrelated question: what if users use LOAD DATA LOCAL but give a non-local path like hdfs://..., or what if users use LOAD DATA but give a local path like file://...?

@vanzin
Copy link
Contributor Author

vanzin commented Dec 9, 2016

what if users use LOAD DATA LOCAL but give a non-local path

To be fair I don't even know what Hive does in those cases. Even its documentation is a little self-contradictory (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML). It says the path can be a URI, but if "LOCAL" is defined, it's always a local path. I'd have to try it on Hive but right now I'm kinda swamped with other things.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 9, 2016

To answer the first question:

> load data local inpath 'hdfs:/user/systest/data.csv' into table test;
Error: Error while compiling statement: FAILED: SemanticException [Error 10028]: Line 1:23 Path is not legal ''hdfs:/user/systest/data.csv'': Source file system should be "file" if "local" is specified (state=42000,code=10028)

@vanzin
Copy link
Contributor Author

vanzin commented Dec 9, 2016

For the second question ("load data" with "file" URI) it seems to move the file from the local file system to the warehouse (as the Hive doc I linked above sort of suggests).

@gatorsmile
Copy link
Member

gatorsmile commented Dec 9, 2016

https://issues.apache.org/jira/browse/HIVE-6024

Based on the above JIRA, Hive has an interal change in Hive 0.14. Do we need to add the related test cases to VersionSuite like what we did for CTAS in #16104?

@vanzin
Copy link
Contributor Author

vanzin commented Dec 9, 2016

@gatorsmile that is not the change that surfaced the problem. Again, this change is not to work around a bug in Hive. This change is because Spark is doing things incorrectly, and we were just lucky to not hit this problem before. Hive is correct. Spark is not. Thus the change.

The changes that actually surface the problem are in Hive 2.1, which Spark does not yet support (officially at least?) as a metastore client.

Internally we have patches from Hive 2.1 in our Hive, so we started seeing this problem. Because Spark is behaving incorrectly, it's better to fix it to avoid future issues.

VersionSuite is not meant to capture these things; it's meant to make sure the HiveShim reflection code is doing the right thing for the various versions of Hive. In fact, it's not the test that failed for us (the "InsertIntoHiveTable" did).

@gatorsmile
Copy link
Member

gatorsmile commented Dec 9, 2016

I can understand the existing way is not correct and we should use LOCAL in LOAD DATA command for populating the value of isSrcLocal. However, we also introduce behavior changes in InsertIntoHiveTable.

After the changes, we always set isSrcLocal to false for InsertIntoHiveTable. After the change these temporary data files in staging directory of InsertIntoHiveTable will be moved to the table location instead of copying to the table location. Is that right?

VersionSuite is also being used for testing end-to-end behaviors in #16104. In the future, we need to add more test cases to ensure the support of all the Hive versions. I think this is the right thing we should continue.

BTW, I am also trying to see whether the test case coverage of our LOAD DATA is complete or not.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 10, 2016

After the change these temporary data files in staging directory of InsertIntoHiveTable will be moved to the table location instead of copying to the table location. Is that right?

It depends. Without this change, it would depend on where the table was. If the table was in HDFS (or anything but the local FS), the files would be moved, so the behavior doesn't change. If the table was in the local filesystem, before this change the files would be copied, and later deleted when the staging directory was deleted. So in the end, it's the same thing.

With the change, the data would be moved in both cases, which is also correct and leads to the same result.

I just want to reinforce, again, that this is not about a change in behavior in Hive at all. This is Spark using a Hive API incorrectly.

VersionSuite is also being used for testing end-to-end behaviors in #16104.

I'm not sure that's such a great idea, but in any case, the tests for this change are the existing tests in "InsertIntoHiveTableSuite" and "HiveCommandSuite". So basically you'd be asking to run those against all different version of Hive metastores supported by Spark. It's doable, but that's a bigger change that I don't really think is necessary here. The Hive semantics haven't changed. Spark was depending on undocumented behavior that worked out of luck, and this change fixes that.


/**
* Run a function with a copy of the input file. Use this for tests that use "LOAD DATA"
* (instead of "LOAD DATA LOCAL") since, according to Hive's semantics, files are moved
Copy link
Contributor

@cloud-fan cloud-fan Dec 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic change happened in Hive 2.1, looks we don't need to update the tests for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the tests need to be updated because now loadTable is being called with "isSrcLocal = false". That makes the source file be moved instead of copied, and that makes subsequent unit tests fail. (That's the cause of the initial test failures in this PR.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can we test LOAD DATA and LOAD DATA LOCAL separately? We can add comments to explain the semantic difference between them and why we need to copy the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can move each of them into separate tests.

@cloud-fan
Copy link
Contributor

The changes LGTM, as we do propagate the isSrcLocal incorrectly. It would be better if we can also fix the inconsistent behavior of LOAD DATA between spark and hive, and improve the test coverage, in a follow-up

@gatorsmile
Copy link
Member

This PR is fixing an issue exposed in Hive 2.1. I am not very clear why Hive 2.1 made such a change. If we can know the background of this change, it might be easier for us to judge whether this is the only issue after the change. At least, I am thinking this PR does not make it worse. I am OK about this PR.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 10, 2016

This PR is fixing an issue exposed in Hive 2.1.

No. This PR is fixing a misuse of an internal hive API. It's important to understand that there's nothing wrong in Hive here, that it's Spark that is using Hive's internal API incorrectly and that it's safer for Spark to not do that.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 10, 2016

If it makes it easier for you to understand why this has nothing to do with Hive 2.1, look at the unit test changes. They show that Spark was behaving differently from Hive in that particular situation ("LOAD DATA" with a warehouse in the local file system - Hive would move the source file, Spark would copy it and leave it around).

@SparkQA
Copy link

SparkQA commented Dec 10, 2016

Test build #69973 has finished for PR 16179 at commit a8482a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 11, 2016

Test build #69976 has finished for PR 16179 at commit b451a70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table""")
}

intercept[AnalysisException] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message is wrong.
LOAD DATA target table default.part_table is partitioned, but number of columns in provided partition spec (1) do not match number of partitioned columns in table (s2);

s2 is incorrect. We need to remove s from the following code:

s"(s${targetTable.partitionColumnNames.size})"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! But I'd say it's not related to this PR, and I won't block merging this PR if this is the only issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it's in LoadDataCommand, line 206

sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table PARTITION(c="1")""")
}
intercept[AnalysisException] {
sql(s"""$loadQuery INPATH "$path" INTO TABLE part_table PARTITION(d="1")""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This negative case is identical to the same to the above. Are you expecting the different error message here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I just moved this code. It was already there before. (Note the partition definition is different, not that I know whether that matters for anything.)

Copy link
Member

@gatorsmile gatorsmile Dec 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, I see.

// employee.dat has two columns separated by '|', the first is an int, the second is a string.
// Its content looks like:
// 16|john
// 17|robert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also move these comments?

@cloud-fan
Copy link
Contributor

LGTM except 2 minor comments: #16179 (comment) and #16179 (comment)

@SparkQA
Copy link

SparkQA commented Dec 12, 2016

Test build #70030 has finished for PR 16179 at commit 4e37c80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Merging to master. Thanks!

@gatorsmile
Copy link
Member

Should we merge it to Spark 2.1? cc @vanzin @rxin @cloud-fan

@asfgit asfgit closed this in 476b34c Dec 12, 2016
@vanzin
Copy link
Contributor Author

vanzin commented Dec 12, 2016

Maybe after 2.1.0 goes out? It's not really a critical fix.

isOverwrite: Boolean,
holdDDLTime: Boolean): Unit
holdDDLTime: Boolean,
isSrcLocal: Boolean): Unit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does isSrcLocal mean? Can you document it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the source data comes from a "LOAD DATA LOCAL" query.

I can add a partial scaladoc to these methods, but I don't really know the meaning of some of the other arguments, so I can't write a complete one.

@vanzin vanzin deleted the SPARK-18752 branch December 12, 2016 22:53
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
The value of the "isSrcLocal" parameter passed to Hive's loadTable and
loadPartition methods needs to be set according to the user query (e.g.
"LOAD DATA LOCAL"), and not the current code that tries to guess what
it should be.

For existing versions of Hive the current behavior is probably ok, but
some recent changes in the Hive code changed the semantics slightly,
making code that sets "isSrcLocal" to "true" incorrectly to do the
wrong thing. It would end up moving the parent directory of the files
into the final location, instead of the file themselves, resulting
in a table that cannot be read.

I modified HiveCommandSuite so that existing "LOAD DATA" tests are run
both in local and non-local mode, since the semantics are slightly different.
The tests include a few new checks to make sure the semantics follow
what Hive describes in its documentation.

Tested with existing unit tests and also ran some Hive integration tests
with a version of Hive containing the changes that surfaced the problem.

Author: Marcelo Vanzin <[email protected]>

Closes apache#16179 from vanzin/SPARK-18752.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
The value of the "isSrcLocal" parameter passed to Hive's loadTable and
loadPartition methods needs to be set according to the user query (e.g.
"LOAD DATA LOCAL"), and not the current code that tries to guess what
it should be.

For existing versions of Hive the current behavior is probably ok, but
some recent changes in the Hive code changed the semantics slightly,
making code that sets "isSrcLocal" to "true" incorrectly to do the
wrong thing. It would end up moving the parent directory of the files
into the final location, instead of the file themselves, resulting
in a table that cannot be read.

I modified HiveCommandSuite so that existing "LOAD DATA" tests are run
both in local and non-local mode, since the semantics are slightly different.
The tests include a few new checks to make sure the semantics follow
what Hive describes in its documentation.

Tested with existing unit tests and also ran some Hive integration tests
with a version of Hive containing the changes that surfaced the problem.

Author: Marcelo Vanzin <[email protected]>

Closes apache#16179 from vanzin/SPARK-18752.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants