Skip to content

Conversation

@zzccctv
Copy link

@zzccctv zzccctv commented Jan 23, 2021

Hivehbasetableinputformat relies on two versions of inputformat,one is org.apache.hadoop.mapred.InputFormat, the other is org.apache.hadoop.mapreduce.InputFormat,Causes both conditions to be true:

  1. classOf[oldInputClass[_, _]].isAssignableFrom(inputFormatClazz) is true
  2. classOf[newInputClass[_, _]].isAssignableFrom(inputFormatClazz) is true
    In view of this situation,It is expected to be compatible with the old version first

@github-actions github-actions bot added the SQL label Jan 23, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

createNewHadoopRDD(localTableDesc, inputPathStr)
} else {
if (classOf[oldInputClass[_, _]].isAssignableFrom(inputFormatClazz)) {
createOldHadoopRDD(localTableDesc, inputPathStr)
Copy link
Contributor

@yikf yikf Jan 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test to suite for this change and explain the error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me see how to write it

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, if it's assignable to both, why does Apache Spark need to use old one? Instead, it sounds like Hivehbasetableinputformat seems to miss the correct implementation for mapreduce.InputFormat. This doesn't look like a Spark issue to me. Is there an Hive JIRA issue for that?

Hivehbasetableinputformat relies on two versions of inputformat,one is org.apache.hadoop.mapred.InputFormat, the other is org.apache.hadoop.mapreduce.InputFormat,Causes both conditions to be true:
classOf[oldInputClass[_, ]].isAssignableFrom(inputFormatClazz) is true
classOf[newInputClass[
, _]].isAssignableFrom(inputFormatClazz) is true
In view of this situation,It is expected to be compatible with the old version first

@HyukjinKwon
Copy link
Member

Is this a duplicate of #29178 and #31147? I have the same question (#29178 (comment)), and I agree with @dongjoon-hyun's.

@yangBottle
Copy link

Hi, if it's assignable to both, why does Apache Spark need to use old one? Instead, it sounds like Hivehbasetableinputformat seems to miss the correct implementation for mapreduce.InputFormat. This doesn't look like a Spark issue to me. Is there an Hive JIRA issue for that?

Hivehbasetableinputformat relies on two versions of inputformat,one is org.apache.hadoop.mapred.InputFormat, the other is org.apache.hadoop.mapreduce.InputFormat,Causes both conditions to be true:
classOf[oldInputClass[_, ]].isAssignableFrom(inputFormatClazz) is true
classOf[newInputClass[
, _]].isAssignableFrom(inputFormatClazz) is true
In view of this situation,It is expected to be compatible with the old version first

I think in order to be compatible with implementation classes similar to 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat' that both uses the old API and the new API, it should be prioritized whether to create the old one.

@HyukjinKwon
Copy link
Member

@yangBottle do you have an official answer from Apache Hadoop community or HBase community that we should look up old ones first?

@HyukjinKwon
Copy link
Member

And, you're basically saying Hivehbasetableinputformat's mapreduce implementation is unable to use, and it will be an issue in that code.

@yangBottle
Copy link

@yangBottle do you have an official answer from Apache Hadoop community or HBase community that we should look up old ones first?

And, you're basically saying Hivehbasetableinputformat's mapreduce implementation is unable to use, and it will be an issue in that code.

No,This is just a personal opinion.And I debug the source code, I found that some initialization operations of hbase are done in the interface implementation of the old api, creating NewHadoopRDD will get an empty hbase table instance.So I think it should look up old ones first.

@zzccctv
Copy link
Author

zzccctv commented Jan 25, 2021

@HyukjinKwon I still think that priority should be given to finding the old ones. This problem involves changes in Hadoop, HBase and hive. The upgrade cost for the user cluster environment is higher. On the contrary, spark is more lightweight

@xza-m
Copy link

xza-m commented Sep 5, 2022

How should I apply this change

@zzccctv
Copy link
Author

zzccctv commented Sep 15, 2022

How should I apply this change

Change it to the code in PR and recompile it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants