-
Notifications
You must be signed in to change notification settings - Fork 1k
WIP: Arrow reader/writer for geometry and geography #8801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
parquet/src/arrow/schema/mod.rs
Outdated
| #[cfg(feature = "geospatial")] | ||
| parquet_geospatial::crs::parquet_to_arrow(arrow_schema, &metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary question I have is related to these lines here. I started off trying to solve the most complex case of geospatial CRS metdata noted by @paleolimbot in this comment:
The issue is when there's an Arrow metadata key of projjson:<key> the actual projjson CRS data exists in the Parquet KeyValue metdata. When reading parquet, this seems to be the last place in the schema conversion chain that metadata is still available.
If we want to avoid breaking the existing public API for building extension types I think converting from the parquet CRS to an Arrow CRS has to happen here since we need access to metadata. An alternate approach would be passing the parquet metadata further down the stack so it's available for users to reference. At the current time it looks like only GEOMETRY and GEOGRAPHY would be consumers of that, but it could be a more future-proof solution if we'd prefer to take that route. (edit) I also considered trying to place the metadata on the hint Field, but each field requires an owned copy of the metadata and I didn't think cloning the metadata once for each field in the schema seemed desirable.
If my review of the PRs was correct, I believe at least some of this tooling is relatively new for the Variant type support from @alamb, so any guidance on a preferred way to handle this case would be appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is when there's an Arrow metadata key of projjson: the actual projjson CRS data exists in the Parquet KeyValue metdata. When reading parquet, this seems to be the last place in the schema conversion chain that metadata is still available.
For what it's worth, there is no writer that can actually write this yet (for a similar reason: no Parquet implementation has access to a mutable key value metadata when converting types on the write end) and if you'd like to punt on this, you can pass the "crs": "projjson:<key>" as the GeoArrow CRS to allow a caller at a higher level to figure this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point, and I would be willing to set this aside for the time being if we don't want to pursue a solution right away. The solution I have presented here is, admittedly, a bit of a hack. It doesn't feel like a huge hack since we're already in the process of converting schema elements etc. but I also wouldn't categorize the solution as "elegant."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not qualified to comment on the elegantness/level of hack (it doesn't look out of place to me, but I'm new here!). In the Arrow C++ implementation I had to modify the signatures of a few functions to pass the key/value metadata reference down into the type conversions. In C++ we have overloads so it wasn't a breaking change to modify the signatures and nobody objected. If it's reasonable to support it would be great to make it happen while we're all here looking at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, here I think it would likely be less of a hack if we pushed the parquet key-value metadata further down the stack by passing metadata: Option<Hashmap<String, String>>. It changes the signature of public methods, but would be an easy migration for users of these low-level methods.
parquet-geospatial/src/crs.rs
Outdated
| if let Some(ext_name) = field.extension_type_name() | ||
| && ext_name == "geoarrow.wkb" | ||
| && let Some(ext_meta) = field.metadata_mut().get_mut("ARROW:extension:metadata") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a much less important question, but this let chain seems to have broken rustfmt for this file. let chains require 2024 edition, but I looked at the rustfmt.toml for the package and it seems to set the edition to 2024, so if anyone has thoughts as to why this breaks I'd love to have rustfmt happy again.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! In addition to the fact that I'm excited to use this, it is valuable to have somebody else's perspective on the spec.
I left some comments but I know you're still working on this, so feel free to ignore them if you had a plan 🙂
| match field.extension_type_name() { | ||
| Some(n) if n == GeometryType::NAME => match field.try_extension_type::<GeometryType>() { | ||
| Ok(GeometryType) => Some(LogicalType::Geometry { crs: todo!() }), | ||
| Err(_e) => None, | ||
| }, | ||
| Some(n) if n == GeographyType::NAME => match field.try_extension_type::<GeographyType>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm sure you've figured out, there's an awkward situation here where we have one ::NAME (geoarrow.wkb and two Parquet logical types. You can figure this out from parsing the "edges" field of the JSON in the "ARROW:extension:metadata" metadata key (if it's not present or it's "planar", it's Geometry, otherwise the algorithm name can be mapped to the enum in the Parquet crate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've been blissfully ignoring this fact for the time being. Thank you for pointing out that planar equates to Geometry though, I didn't have that on my radar and probably would have missed it!
| #[cfg(not(feature = "geospatial"))] | ||
| pub(crate) fn logical_type_for_binary_view(field: &Field) -> Option<LogicalType> { | ||
| None | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure anybody is using large_binary on purpose, but if it's easy to do it may be worth supporting that here.
parquet/src/arrow/schema/mod.rs
Outdated
| #[cfg(feature = "geospatial")] | ||
| parquet_geospatial::crs::parquet_to_arrow(arrow_schema, &metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not qualified to comment on the elegantness/level of hack (it doesn't look out of place to me, but I'm new here!). In the Arrow C++ implementation I had to modify the signatures of a few functions to pass the key/value metadata reference down into the type conversions. In C++ we have overloads so it wasn't a breaking change to modify the signatures and nobody objected. If it's reasonable to support it would be great to make it happen while we're all here looking at this.
parquet-geospatial/src/types.rs
Outdated
| pub struct Geometry { | ||
| crs: Crs, | ||
| } | ||
|
|
||
| impl Geometry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may help to have just one struct Wkb { crs: Crs, edges: String } that implements ExtensionType. Then the try_extension_type::<Wkb>() below will parse the edges for us and let us choose geometry or geography for the Parquet type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I just kinda shoved this in here and stubbed it out so I could better understand how the code flow of this crate works since I'm still learning my way around. My plan for this was to more or less adopt @kylebarron's WkbType.
That actually brings another good question, which is if we move the WkbType here, would the GeoArrow-rs family of crates end up using this version because it's upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly a possibility...it's also a problem with a much larger scope than converting metadata. I won't speak for Kyle but I think the GeoArrow crate would possibly benefit from evolving at its own rate rather than be tied to the arrow-rs release cadance.
parquet-geospatial/src/crs.rs
Outdated
| Self::Projjson(serde_json::from_str(proj_meta).unwrap()) | ||
| } else if let Some(srid) = crs.strip_prefix("srid:") { | ||
| Self::Srid(srid.parse().unwrap()) | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An important case to handle here is the one where the crs can be parsed as JSON (Self::Projjson(...)), which is how most CRSes will arrive here with current implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I'm understanding this case correctly. The incoming metadata would look like:
{"crs": "<PROJJSON string>"} (omitting the projjson: prefix)
It does seem convenient to try to parse the above case as json, and try to detect if it seems like a PROJJSON object. Since we don't have access to an actual PROJ parser here I would be a little bit concerned about the potential for false positives or potentially malformed PROJ objects. I don't think I feel super strongly about those concerns though. It seems like it's one of those trade offs between user convenience vs guaranteed correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I'm understanding this case correctly. The incoming metadata would look like:
{"crs": "<PROJJSON string>"}
From Parquet, the actual crs payload is most commonly Some("<PROJJSON string>"). This is because there is no way to reasonably implement dumping the string into the metadata and writing Some("projjson:<key>") in Arrow C++, and because most CRSes arrive from GeoArrow as PROJJSON.
From GeoArrow, it's possible to get {"crs": "<PROJJSON string>"} (at least one version of GeoPandas), {"crs": "AUTH:CODE"}, or {"crs": {<JSON object of PROJJSON>}} (most common).
I would be a little bit concerned about the potential for false positives or potentially malformed PROJ objects
For the specific case of converting Parquet CRSes to GeoArrow ones, I don't think this matters. It's this implementation's job to pass on as much CRS information to GeoArrow as accurately as it can (and vice versa). If you try to parse JSON and it fails, you can (and I would argue you should) just pass on the string in either direction.
parquet-geospatial/src/crs.rs
Outdated
| #[derive(Debug)] | ||
| pub enum Crs { | ||
| Projjson(serde_json::Value), | ||
| Srid(u64), | ||
| Other(String), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may help to clarify if this is intended to be a GeoArrow CRS or a Parquet CRS (it seems like this is intended to match the Parquet CRS model)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is following the Parquet CRS. I figured since it was in the parquet-geospatial sub-crate the spec being parquet would be understood. I'm glad you noted this, because it seems my assumption was incorrect. 😄
Based on our previous conversation here #8222 (comment) I thought using the Parquet CRS seems like the safest option for users. Since the GeoArrow CRS allows a truly unspecified/unknown option and Parquet does not, it's not always safe to serialize GeoArrow as Parquet.
As a user, I think I would personally like to know and handle that case rather than have some unspecified GeoArrow be serialized an blindly interpreted as OGC:CRS84. This is, of course, just a personal preference. I know there are plenty of real-world users who would probably prefer a "best effort" approach and would like to just deal with incorrect metadata later. All that being said, I would be willing to adopt a more lenient approach if we think that's a better course of action.
parquet-geospatial/src/crs.rs
Outdated
| match de["crs_type"].as_str() { | ||
| Some("projjson") => Crs::Projjson(crs.clone()), | ||
| Some("srid") => Crs::Srid(crs.as_number().unwrap().as_u64().unwrap()), | ||
| _ => Crs::Other(crs.to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the crs_type is optional (more recently added to the GeoArrow spec), it's common to have the crs be PROJJSON without the crs_type letting us know it's the case. There is also a fun case where GeoPandas exported JSON as an escaped JSON string for a few versions (it's been fixed now but it was fresh when I did the Arrow C++ implementation so I handle it there).
parquet-geospatial/src/crs.rs
Outdated
| let Some(crs) = de["crs"].as_str() else { | ||
| return Crs::Srid(4326); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be clearer as Crs::LonLat (neither GeoArrow nor Parquet specifications guarantee that an SRID is an EPSG code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand this a bit better?
This code was an attempt to solve the case where a parquet file omits the CRS metadata. The spec states:
The default CRS OGC:CRS84 means that the geospatial features must be stored in the order of longitude/latitude based on the WGS84 datum.
...
For geographic CRS, longitudes are bound by [-180, 180] and latitudes are bound by [-90, 90].
Doesn't this functionally bind an unspecified CRS to SRID:4326 ? SRID:3857 also WGS84 datum, but it doesn't satisfy the lon/lat bounding conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could guarantee that an "srid" value in either Parquet or GeoArrow referred to an EPSG identifier, it would be safe to make that assumption; however, neither make any guarantees about SRIDs and so writing a GeoArrow CRS as {"crs": "4326", "crs_type": "srid"} actually won't get interpreted as lon/lat in most cases, and there is no guarantee that a Parquet file with a crs of srid:4326 will be interpreted correctly (although it probably will be).
The canonical way to write "lon/lat" crses to Parquet is None (i.e., not present); the canonical way to write "lon/lat" to GeoArrow is either {"crs": "OGC:CRS84", "crs_type": "authority_code"} or the 4 KB PROJJSON string with the full definition (although there are a few other ways it could possibly show up in GeoArrow):
Arrow C++ tries as ever hard as it possibly can to detect a GeoArrow CRS that is lon/lat to ensure that the Parquet representation is None, and converts a None Parquet crs to GeoArrow metadata of {"crs": "OGC:CRS84", "crs_type": "authority_code"} because I didn't feel like explaining to non-spatial reviewers why we needed 4 KB of JSON to solve that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this was helpful, especially the insight into the GeoArrow ecosystem. I'm much more familiar with the GeoParquet ecosystem where you either assume EPSG:4326 or you parse PROJJSON. I'm going to respond more thoroughly to another comment further down because I think this all dovetails together in the overall CRS metadata type definition.
because I didn't feel like explaining to non-spatial reviewers why we needed 4 KB of JSON to solve that case.
This made me chuckle 😂
parquet-geospatial/src/crs.rs
Outdated
| #[derive(Debug)] | ||
| pub enum Crs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if struct Crs { value: String, type: Option<String> } might help unify this between Parquet and GeoArrow. It's hard to abstract Crses and I'm not sure the enum here has made the conversion code more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree your proposed struct definition would better unify the parquet and arrow CRS definitions. I believe this is exactly what @kylebarron has in geoarrow-rs as well. Using that CRS struct definition could also get us some "free" serialization/deserialization if we leveraging serde traits, which could be nice.
While I agree the conversion code isn't any more clear (perhaps the opposite, really), using the enum does buy us some additional safety. Users of the enum get compile time checks that they've handled all possible cases, and it's "impossible" (outside of a glaring bug) to accidentally have a mismatch between the crs_type and the actual crsdata. We also get some additional strong typing of the crs data, which I personally prefer. Again, this is mostly my preferred way of working with data, so if we end up thinking a less rigorous approach is more broadly appealing I'm happy to change course!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also get some additional strong typing of the crs data, which I personally prefer.
If you can guarantee this, it's great! I think my point was that the current implementation (which I know is in the works) doesn't quite do that (e.g., there are some common that currently go through Other(String) where the string is actually PROJJSON), so perhaps it would help reduce the scope of the Crs to make it simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my point was that the current implementation (which I know is in the works) doesn't quite do that (e.g., there are some common that currently go through
Other(String)where the string is actually PROJJSON), so perhaps it would help reduce the scope of theCrsto make it simpler.
Yes, I think this is a fair point, especially considering your more detailed description of the landscape around CRS metadata above. Based on what I've gathered from this discussion at large is that the Crs type here is perhaps over-constrained. I think my general thought with the type written here was more that in cases where we can very confidently say it's PROJJSON or an SRID we can present it as such, and the Other(String) cases would need to be handled case-by-case at a higher level.
There's probably quite a bit of wisdom in keeping the underlying type simple, and just presenting the CRS as it exists in the file and putting the onus on the user to do with it what they will.
|
@paleolimbot thank you so much for a thorough and thoughtful review so early on in the development of this! I honestly didn't expect anyone to trudge through all this code in its very very rough state. Getting some of these questions/comments ironed out earlier rather than later will make my life easier when actual merge-ready reviews come up, so I genuinely appreciate you taking the time to have looked at this already. |
|
@paleolimbot After your initial review and taking a look at the conversations on the initial rough implementation of this by @kylebarron I've made some updates and would really appreciate if everyone can take a look. I get the sense that rather than a smarter parser for the metadata we're more looking for this library to act as a spec compliant "data tube" and instead let users handle working with the metadata that's in the file however they wish (including validity, handling missing metadata etc.). Any thoughts are very much appreciated! |
parquet-geospatial/src/crs.rs
Outdated
|
|
||
| impl Crs { | ||
| // TODO: make fallible | ||
| fn try_from_parquet_str(crs: &str, metadata: &HashMap<String, String>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has functionally changed to only detect the case where the metadata is in the form of {"crs": "projjson:<metadata key>"} and pass through all other cases as their raw string value(s) to be handled at a higher level. This CRS format is the only one where the CRS metadata isn't fully contained in the raw string value, so some minor intervention is still necessary.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small notes but I think in general this looking good!
we're more looking for this library to act as a spec compliant "data tube"
I think so!
parquet-geospatial/src/crs.rs
Outdated
| pub(super) fn to_arrow_string(&self) -> String { | ||
| serde_json::to_string(&self).unwrap_or_default() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid output like {"crs": "{\"type\": \"GeographicCRS\", ...", "crs_type": "projjson"} (i.e., JSON escaped as a JSON string), I think it might be best to attempt parsing self.crs (and if it succeeds, return the corresponding Value). This will also help with roundtripping as the metadata probably arrived as {"crs": {"type": "GeographicCRS", ...", "crs_type": "projjson"}. This may solve the roundtrip testing issue if there was one.
Because "edges" will also need to get added here, maybe this could return (Option<Value>, Option<Value>)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending some decisions related to my comment here: #8801 (comment) I get the sense this method may just disappear entirely. Right now it's mostly poorly named since we aren't really converting things to "arrow" anymore anyway.
That being said, I will keep an eye out for a good place to check the incoming CRS value for JSON to avoid the double escaped string issues!
parquet/src/arrow/schema/mod.rs
Outdated
| #[cfg(feature = "geospatial")] | ||
| parquet_geospatial::crs::replace_keyvalue(arrow_schema, &metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to modify the embedded schema here, which is probably more accurate (i.e., already contains the CRS in the form the user intended)...was this to help with roundtrip testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a name change related to our previous discussion here:
As the public API currently stands we have to do something to intervene for the case where there's a key-value CRS if we want that CRS metadata associated with the resulting WkbArray. Since no writers currently support writing this format perhaps it's moot and I think it could be pretty easily justified that this should just be removed. On the surface it seems valuable to me to at least be able to read this case and provide the underlying CRS meta, but if it ends up causing confusion because the output CRS isn't the same as the potential input that could be a pretty good justification for removing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're on the same page...I may be misunderstanding how this might get invoked, but I think:
- A user wants to write arrow data with a given schema, and that arrow schema contains a column with GeoArrow metadata
- The existing implementation serializes that schema, including the GeoArrow metadata on the column and puts it in this key/value key
- Independently, the Parquet schema converts the Parquet logical types for readers that can't (or have been requested not to) inspect the embedded Arrow schema.
- Readers that read the file from Arrow get the type exactly as it was written
- Readers that read the file from elsewhere still have correct CRS information as it was translated to Parquet
Is there another situation I haven't considered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also think we're on the same page. I was thinking of a situation where this code would perform some manner of validation on the CRS metadata, regardless of whether it was communicated via Arrow or the Parquet key/value metadata. Since we're moving away from validation I think I'm convinced this code to pull that PROJJSON from the Parquet key/value metadata is no longer necessary.
- Removes some unused code - Shows current functionality with basic tests for reading and writing both geometry and geography
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| type_hint: Option<Hint>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to highlight this change because I think it merits discussion. I've introduced a type_hint into the metadata for this extension because I believe it's necessary to solve a round-trip for a specific ambiguous case.
If my reading of the spec is correct, having a parquet GEOGRAPHY logical type with no metadata is a valid case (the default values of crs: OGC:CRS84 and algorithm: SPHERICAL would apply for readers of the column). If we want to support writing this case while preserving the (lack of) input metadata we need some additional information to inform us of the underlying type of the array data.
This field is obviously not a part of the parquet metadata spec, nor the GeoArrow metadata spec. However, the WkbArray that has been introduced here doesn't claim to be either, so perhaps this is ok. If we want to ensure the metadata here strictly follows one spec or the other we will likely have to drop back to determining the parquet type of the array heuristically based on the algorithm field which could result in incorrect logical types for the ambiguous case. While I suspect this case is relatively rare, I think it's important to note it could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to support writing this case while preserving the (lack of) input metadata
You're absolutely correct to spot this case! I am not personally worried about this one...I think you're safe to canonically export spherical edges as omitted if you'd like (this is what Arrow C++ will do today). Also if you're passionate about preserving this particular round trip, go for it!
I may have missed a case, although I think that all of the metadata you're proposing here is correct for both the Parquet and GeoArrow specs (there are just multiple ways to represent some things and it's often useful to roundtrip things where possible?).
Er, I suppose putting a "type_hint" key in GeoArrow extension metadata is a bit fishy although I don't think it will cause any existing parsers to error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the type_hint (regardless of whether or not this would be tolerated by existing parsers) more or less makes this metadata neither Parquet nor GeoArrow. I think the more philosophical question of "should this metadata be Parquet or GeoArrow" is important here. I'll preface this by saying I don't really feel too strongly about this and my primary goal is to make sure this is usable for the community at large, however, I'll at least detail my thought process behind it.
I think even without the type_hint the metadata we present here is already not really GeoArrow metadata, in spite of it generally looking compatible. Since cases like {"crs": "projjson:my_crs_metadata_key"} don't actually communicate the CRS without user (or other library, likely geoarrow-rs) intervention there's already inherently some additional inspection of this metdata necessary to convert it to GeoArrow. If users of the metadata are going to need to inspect and potentially convert it anyway, we may as well allow the metadata to support all the cases the underlying parquet spec supports.
Now, all that being said, I think we're in agreement that the case this supports isn't a huge concern. If it makes most sense from an ecosystem compatibility perspective to discard the type_hint and accept that we technically have an unsupported case I think that's likely the strongest justification for how to move forward here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata we present here is already not really GeoArrow metadata
I do think it's important to output valid GeoArrow metadata, at least in the sense that it conforms to the spec and tries its best to pass on any information it has (arbitrary strings in GeoArrow are permitted for this purpose!). (This Metadata struct can definitely represent anything we'd like it to...I'm talking strictly about the JSON output we write to the output Field).
accept that we technically have an unsupported case
I see this more of an inability to roundtrip one of two possible representations of spherical edges from Parquet, through GeoArrow, then back to Parquet. I'm comfortable with that but let me know if there's something beyond this I didn't see!
| pub struct WkbArray { | ||
| inner: ArrayRef, | ||
| metadata: Metadata, | ||
| } | ||
|
|
||
| impl WkbArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This WkbArray is a new addition in this most recent commit.
@alamb do you have some time to quickly look over this new type (as it's still WIP) and provide any feedback regarding the general API? I've tried to mimic the work done for the VariantArray as closely as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It seems reasonable from my end although I don't really know how the extension array/extension type/metadata is supposed to work from the arrow-rs end of things and Andrew is definitely the right person to answer that)
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, all of these tests are passing, which I believe indicates this code (while still in need of cleanup, proper errors, more tests, docs etc. etc.) is generally functioning as expected 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth putting these tests in parquet/tests/geospatial.rs? There might be some helper functions there for some of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, part of the "WIP" nature of this as it stands right now is I've put little to no effort into organizing these tests. I just needed something so I could iterate and it was easy to copy/paste from the Variant tests and put them here 😄 I'm not very familiar with this crate's organization, so if parquet/tests/geospatial.rs is the right spot I'm happy to move them there and deduplicate any helpers as this work gets closer to being merge-ready.
|
@paleolimbot I've further refined this implementation and introduced the actual array type in this most recent PR. If you could find some time for another review and provide any thoughts I'd greatly appreciate it. I've highlighted a couple of key areas with some comments to hopefully make it a bit easier for reviewers. Any feedback outside of those areas is obviously welcome as well! Given the fact that the tests to both read and write both of the parquet types are working, I believe the functional components of this code are getting close. There's obviously plenty of additional validation, tests, docs etc. that need to be written. The size of this PR is already getting a bit large, so if anyone has input on how we might want to split the work into more easily reviewable parts to actually start pushing it through (once we finish getting consensus on this WIP) I'd be happy to consider that when converting this to "real" PRs. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're still working to clean up/complete a few things but I think conceptually this is very close! I left a comment inline but I think removing the type_hint would be best unless there's something I'm missing.
The size of this PR is already getting a bit large
I think this is a great scope/size for me so far...no worries! The WkbArray seems orthogonal to this work but if it helps you write nice tests go for it!
| #[serde(skip_serializing_if = "Option::is_none")] | ||
| type_hint: Option<Hint>, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to support writing this case while preserving the (lack of) input metadata
You're absolutely correct to spot this case! I am not personally worried about this one...I think you're safe to canonically export spherical edges as omitted if you'd like (this is what Arrow C++ will do today). Also if you're passionate about preserving this particular round trip, go for it!
I may have missed a case, although I think that all of the metadata you're proposing here is correct for both the Parquet and GeoArrow specs (there are just multiple ways to represent some things and it's often useful to roundtrip things where possible?).
Er, I suppose putting a "type_hint" key in GeoArrow extension metadata is a bit fishy although I don't think it will cause any existing parsers to error.
| pub struct WkbArray { | ||
| inner: ArrayRef, | ||
| metadata: Metadata, | ||
| } | ||
|
|
||
| impl WkbArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It seems reasonable from my end although I don't really know how the extension array/extension type/metadata is supposed to work from the arrow-rs end of things and Andrew is definitely the right person to answer that)
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth putting these tests in parquet/tests/geospatial.rs? There might be some helper functions there for some of this.
|
@paleolimbot As always, thank you for your thoughts!
I left a quick response detailing my current thoughts around this. As stated there, I'm don't really feel too strongly about this and if my reasoning doesn't sway you I'm happy to remove the
Yes, I think it's an easy candidate to carve off into a separate PR. I think what will likely happen is, after we've reached a good consensus on this WIP, I will probably open separate "merge ready" PRs that take the code from this WIP, clean up names, error propagation, unit tests, doc strings etc. That way the scope of an individual PR is smaller and quicker to review, which based on my recent experience over in the DataFusion repo, is desirable for the core maintainers (and probably everyone else too!). |
Which issue does this PR close?
N/A - This is a very early, in progress, implementation that I would like to use get some early feedback from reviewers for
GEOMETRYandGEOGRAPHY#8717Are there any user-facing changes?
No
Additional Notes
This is (hopefully quite clearly) in a very in-progress state. I'm happy to take any feedback, but panics, lingering debug prints and TODOs etc. would all be cleaned up before I would consider any of this ready for any thorough review. I've highlighted the specific code I have questions about for a general approach so reviewers don't have to trudge through all of what's here unless they're feeling very motivated. I also would not be surprised if the general organization of the code also changes as the implementation matures, but if anyone has strong opinions on the organization I'm happy to take those into consideration as this develops.