Skip to content

Conversation

xxlaykxx
Copy link
Contributor

@xxlaykxx xxlaykxx commented Apr 1, 2025

Based on changes from apache/arrow#41731.

What's Changed

Added writer ExtensionWriter with 3 methods:

  • write method for writing values from Extension holders;
  • writeExtensionType method for writing values (arguments is Object because we don't know exact type);
  • addExtensionTypeFactory method - because the exact vector and value type are unknown, the user should create their own extension type vector, write for it, and ExtensionTypeFactory, which should map the vector and writer.

Closes #87.

Co-authored-by: Finn Völkel [email protected]

Based on changes from apache/arrow#41731.

Added writer ExtensionWriter with 3 methods:
- write method  for writing values from Extension holders;
- writeExtensionType method for writing values (arguments is Object because we don't know exact type);
- addExtensionTypeFactory method - because exact vector and value type are unknown, user should create their own extension type vector, writer for it and ExtensionTypeFactory where it should map vector and writer.

This comment has been minimized.

@xxlaykxx
Copy link
Contributor Author

xxlaykxx commented Apr 1, 2025

@lidavidm, please take a look if this approach can be used

@lidavidm lidavidm added enhancement PRs that add or improve features. and removed breaking-change labels Apr 3, 2025
@github-actions github-actions bot added this to the 18.3.0 milestone Apr 3, 2025
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we preserve the original committer's information if possible so they can get the credit for the original work? (A Co-authored-by tag in the PR should suffice I think)

Is there some way we could have a type-safe design instead of just Object everywhere?


import org.apache.arrow.vector.ExtensionTypeVector;

public interface ExtensionTypeWriterFactory<T extends AbstractFieldWriter> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the type bound need to be AbstractFieldWriter or can it just be FieldWriter (the interface)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose AbstractFieldWriter just to make sure that the user will use some specific implementation of writer, but in general yes - it could be FieldWriter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO abstract implementation classes should not go in generic bounds - it should be the interface type

this.vector = vector;
}

@Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can provide default implementations for these functions to reduce the boilerplate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to just pull the UuidType into a toplevel class now that it's being used by multiple tests.

@lidavidm lidavidm changed the title GH-87: [Java][Vector] Add ExtensionWriter for List/Struct. GH-87: [Java][Vector] Add ExtensionWriter Apr 3, 2025
@lidavidm lidavidm changed the title GH-87: [Java][Vector] Add ExtensionWriter GH-87: [Vector] Add ExtensionWriter Apr 3, 2025
@xxlaykxx xxlaykxx changed the title GH-87: [Vector] Add ExtensionWriter GH-87 [Vector] Add ExtensionWriter Apr 4, 2025
@xxlaykxx xxlaykxx changed the title GH-87 [Vector] Add ExtensionWriter GH-87: [Vector] Add ExtensionWriter Apr 4, 2025
@xxlaykxx
Copy link
Contributor Author

xxlaykxx commented Apr 4, 2025

Can we preserve the original committer's information if possible so they can get the credit for the original work? (A Co-authored-by tag in the PR should suffice I think)

Is there some way we could have a type-safe design instead of just Object everywhere?

Because the type is unknown, we could allow only Holder impl for writing. Is this acceptable?


public class TestUuidVector {

public static class UuidType extends ExtensionType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was, can we just make this a toplevel class? Not a nested class?

}
}

public static class UuidVector extends ExtensionTypeVector<FixedSizeBinaryVector>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

import org.apache.arrow.vector.types.pojo.ArrowType.ExtensionType;
import org.apache.arrow.vector.util.TransferPair;

public class TestUuidVector {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact why is this class here at all? All it does is wrap the two inner classes

void writeNull();
<T extends ExtensionHolder> void write(T var1);
void writeExtensionType(Object var1);
<T extends ExtensionTypeWriterFactory> void addExtensionTypeFactory(T var1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the existing code is short on docstrings, can we add docstrings for new code going forward? In particular it's not clear what addExtensionTypeFactory is or how to use it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the generic bounds actually useful here? Why can't it just be ExtensionTypeWriterFactory var1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace var1 with meaningful parameter names?


public interface ExtensionWriter extends BaseWriter {
void writeNull();
<T extends ExtensionHolder> void write(T var1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - why not just void write(ExtensionHolder value)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


import org.apache.arrow.vector.ExtensionTypeVector;

public interface ExtensionTypeWriterFactory<T extends AbstractFieldWriter> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO abstract implementation classes should not go in generic bounds - it should be the interface type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any test of using ExtensionHolder?

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just some final naming questions

void writeNull();

/**
* Writes vlaue from the given extension holder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Writes vlaue from the given extension holder.
* Writes value from the given extension holder.


@Override
public void write(ExtensionHolder holder) {
if (holder instanceof UuidHolder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either delete the check to be consistent with writeExtensionType or explicitly error if the holder is the wrong type

import org.apache.arrow.vector.holder.UuidHolder;
import org.apache.arrow.vector.holders.ExtensionHolder;

public class UuidWriterImpl extends AbstractExtensionTypeWriter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class UuidWriterImpl extends AbstractExtensionTypeWriter {
public class UuidWriterImpl extends AbstractExtensionTypeWriter<UuidVector> {

ByteBuffer bb = ByteBuffer.allocate(16);
bb.putLong(uuid.getMostSignificantBits());
bb.putLong(uuid.getLeastSignificantBits());
((UuidVector) this.vector).setSafe(this.idx(), bb.array());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
((UuidVector) this.vector).setSafe(this.idx(), bb.array());
vector.setSafe(idx(), bb.array());

Comment on lines 45 to 46
((UuidVector) this.vector).setSafe(this.idx(), uuidHolder.value);
this.vector.setValueCount(this.idx() + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
((UuidVector) this.vector).setSafe(this.idx(), uuidHolder.value);
this.vector.setValueCount(this.idx() + 1);
vector.setSafe(idx(), uuidHolder.value);
vector.setValueCount(idx() + 1);

*
* @param value the extension type value to write
*/
void writeExtensionType(Object value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit but now I'm thinking it should be just writeExtension to parallel writeVarChar

*
* @param factory the extension type factory to add
*/
void addExtensionTypeFactory(ExtensionTypeWriterFactory factory);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void addExtensionTypeFactory(ExtensionTypeWriterFactory factory);
void addExtensionTypeWriterFactory(ExtensionTypeWriterFactory factory);

for consistency as well


@Override
protected int idx() {
return super.idx();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is the only reason we override here to make it protected instead of package-private? (Should we just use getPosition() instead as it is already public and does the same thing?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, makes sense - logic with idx() method added for all other writers.

@xxlaykxx xxlaykxx requested a review from lidavidm April 9, 2025 06:43
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please see the pre-commit failures

@lidavidm lidavidm merged commit 437612c into apache:main Apr 9, 2025
25 of 26 checks passed
@lidavidm
Copy link
Member

lidavidm commented Apr 9, 2025

Thanks @xxlaykxx @FiV0!

xxlaykxx added a commit to dremio/arrow-java that referenced this pull request Apr 23, 2025
Based on changes from apache/arrow#41731.

## What's Changed

Added writer ExtensionWriter with 3 methods:
- write method  for writing values from Extension holders;
- writeExtensionType method for writing values (arguments is Object
because we don't know exact type);
- addExtensionTypeFactory method - because the exact vector and value
type are unknown, the user should create their own extension type
vector, write for it, and ExtensionTypeFactory, which should map the
vector and writer.

Closes apache#87.

Co-authored-by: Finn Völkel <[email protected]>
xxlaykxx added a commit to dremio/arrow-java that referenced this pull request Apr 29, 2025
lriggs pushed a commit to lriggs/arrow-java that referenced this pull request Jul 14, 2025
lriggs pushed a commit to lriggs/arrow-java that referenced this pull request Jul 14, 2025
* apacheGH-87: [Vector] Add ExtensionWriter (apache#697)
missed file

* apacheGH-87: [Vector] Add ExtensionWriter (apache#697)
missed file

* apacheGH-87: [Vector] Add ExtensionWriter (apache#697)
updated UnionListWriter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement PRs that add or improve features.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] StructVector throws with ExtensionType
2 participants