-
Notifications
You must be signed in to change notification settings - Fork 90
GH-87: [Vector] Add ExtensionWriter #697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Based on changes from apache/arrow#41731. Added writer ExtensionWriter with 3 methods: - write method for writing values from Extension holders; - writeExtensionType method for writing values (arguments is Object because we don't know exact type); - addExtensionTypeFactory method - because exact vector and value type are unknown, user should create their own extension type vector, writer for it and ExtensionTypeFactory where it should map vector and writer.
This comment has been minimized.
This comment has been minimized.
@lidavidm, please take a look if this approach can be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we preserve the original committer's information if possible so they can get the credit for the original work? (A Co-authored-by
tag in the PR should suffice I think)
Is there some way we could have a type-safe design instead of just Object
everywhere?
|
||
import org.apache.arrow.vector.ExtensionTypeVector; | ||
|
||
public interface ExtensionTypeWriterFactory<T extends AbstractFieldWriter> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the type bound need to be AbstractFieldWriter
or can it just be FieldWriter
(the interface)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose AbstractFieldWriter just to make sure that the user will use some specific implementation of writer, but in general yes - it could be FieldWriter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO abstract implementation classes should not go in generic bounds - it should be the interface type
vector/src/main/java/org/apache/arrow/vector/holders/ExtensionHolder.java
Show resolved
Hide resolved
this.vector = vector; | ||
} | ||
|
||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can provide default implementations for these functions to reduce the boilerplate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to just pull the UuidType into a toplevel class now that it's being used by multiple tests.
Because the type is unknown, we could allow only Holder impl for writing. Is this acceptable? |
|
||
public class TestUuidVector { | ||
|
||
public static class UuidType extends ExtensionType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was, can we just make this a toplevel class? Not a nested class?
} | ||
} | ||
|
||
public static class UuidVector extends ExtensionTypeVector<FixedSizeBinaryVector> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
import org.apache.arrow.vector.types.pojo.ArrowType.ExtensionType; | ||
import org.apache.arrow.vector.util.TransferPair; | ||
|
||
public class TestUuidVector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact why is this class here at all? All it does is wrap the two inner classes
void writeNull(); | ||
<T extends ExtensionHolder> void write(T var1); | ||
void writeExtensionType(Object var1); | ||
<T extends ExtensionTypeWriterFactory> void addExtensionTypeFactory(T var1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the existing code is short on docstrings, can we add docstrings for new code going forward? In particular it's not clear what addExtensionTypeFactory
is or how to use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the generic bounds actually useful here? Why can't it just be ExtensionTypeWriterFactory var1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace var1
with meaningful parameter names?
|
||
public interface ExtensionWriter extends BaseWriter { | ||
void writeNull(); | ||
<T extends ExtensionHolder> void write(T var1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - why not just void write(ExtensionHolder value)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
import org.apache.arrow.vector.ExtensionTypeVector; | ||
|
||
public interface ExtensionTypeWriterFactory<T extends AbstractFieldWriter> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO abstract implementation classes should not go in generic bounds - it should be the interface type
vector/src/main/java/org/apache/arrow/vector/complex/impl/ExtensionTypeWriterFactory.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any test of using ExtensionHolder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just some final naming questions
void writeNull(); | ||
|
||
/** | ||
* Writes vlaue from the given extension holder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Writes vlaue from the given extension holder. | |
* Writes value from the given extension holder. |
|
||
@Override | ||
public void write(ExtensionHolder holder) { | ||
if (holder instanceof UuidHolder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either delete the check to be consistent with writeExtensionType or explicitly error if the holder is the wrong type
import org.apache.arrow.vector.holder.UuidHolder; | ||
import org.apache.arrow.vector.holders.ExtensionHolder; | ||
|
||
public class UuidWriterImpl extends AbstractExtensionTypeWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public class UuidWriterImpl extends AbstractExtensionTypeWriter { | |
public class UuidWriterImpl extends AbstractExtensionTypeWriter<UuidVector> { |
ByteBuffer bb = ByteBuffer.allocate(16); | ||
bb.putLong(uuid.getMostSignificantBits()); | ||
bb.putLong(uuid.getLeastSignificantBits()); | ||
((UuidVector) this.vector).setSafe(this.idx(), bb.array()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
((UuidVector) this.vector).setSafe(this.idx(), bb.array()); | |
vector.setSafe(idx(), bb.array()); |
((UuidVector) this.vector).setSafe(this.idx(), uuidHolder.value); | ||
this.vector.setValueCount(this.idx() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
((UuidVector) this.vector).setSafe(this.idx(), uuidHolder.value); | |
this.vector.setValueCount(this.idx() + 1); | |
vector.setSafe(idx(), uuidHolder.value); | |
vector.setValueCount(idx() + 1); |
* | ||
* @param value the extension type value to write | ||
*/ | ||
void writeExtensionType(Object value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit but now I'm thinking it should be just writeExtension
to parallel writeVarChar
* | ||
* @param factory the extension type factory to add | ||
*/ | ||
void addExtensionTypeFactory(ExtensionTypeWriterFactory factory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void addExtensionTypeFactory(ExtensionTypeWriterFactory factory); | |
void addExtensionTypeWriterFactory(ExtensionTypeWriterFactory factory); |
for consistency as well
|
||
@Override | ||
protected int idx() { | ||
return super.idx(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is the only reason we override here to make it protected
instead of package-private? (Should we just use getPosition()
instead as it is already public and does the same thing?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, makes sense - logic with idx() method added for all other writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please see the pre-commit failures
Based on changes from apache/arrow#41731. ## What's Changed Added writer ExtensionWriter with 3 methods: - write method for writing values from Extension holders; - writeExtensionType method for writing values (arguments is Object because we don't know exact type); - addExtensionTypeFactory method - because the exact vector and value type are unknown, the user should create their own extension type vector, write for it, and ExtensionTypeFactory, which should map the vector and writer. Closes apache#87. Co-authored-by: Finn Völkel <[email protected]>
apacheGH-87: [Vector] Add ExtensionWriter (apache#697)
* apacheGH-87: [Vector] Add ExtensionWriter (apache#697) missed file
* apacheGH-87: [Vector] Add ExtensionWriter (apache#697) missed file * apacheGH-87: [Vector] Add ExtensionWriter (apache#697) missed file * apacheGH-87: [Vector] Add ExtensionWriter (apache#697) updated UnionListWriter.
Based on changes from apache/arrow#41731.
What's Changed
Added writer ExtensionWriter with 3 methods:
Closes #87.
Co-authored-by: Finn Völkel [email protected]