feat(rust/sedona-schema): support WKT1/WKT2 CRS strings in deserialize_crs#953
feat(rust/sedona-schema): support WKT1/WKT2 CRS strings in deserialize_crs#953james-willis wants to merge 4 commits into
Conversation
…e_crs Add a Wkt CRS type that carries the WKT verbatim (so PROJ can consume it directly for transforms) and extracts a best-effort authority:code from the trailing AUTHORITY[...] / ID[...] tag for equality, SRID, and geographic- parameter lookups. WKT is detected by leading keyword and routed ahead of the PROJJSON fallback in deserialize_crs. Adds consumer coverage (ST_Transform WKT1/WKT2, ST_SRID, ST_SetCRS, ST_CRS) and st_setcrs/st_crs doc callouts for the authority-code canonicalization.
ST_CRS fell back to to_json() for a CRS with no authority code, which JSON- escapes the value (e.g. a WKT came back wrapped in quotes and could not be fed back into ST_SetCRS). Use to_crs_string() instead — the unescaped form intended as input to PROJ/GDAL — so an authority-less WKT round-trips verbatim.
831872b to
289de75
Compare
…RS/RS_SetCRS Mirror the ST_* WKT tests on the raster side: RS_SRID resolves a WKT with an EPSG authority to its SRID and errors on an authority-less WKT; RS_CRS echoes a stored WKT verbatim; RS_SetCRS canonicalizes an EPSG-bearing WKT to its authority code (via normalize) and stores an authority-less WKT verbatim. (No RS_Transform UDF exists, so there's nothing to cover there.)
| Some("EPSG:3857") | ||
| ); | ||
| assert_eq!(crs.srid().unwrap(), Some(3857)); | ||
| assert_eq!(crs.to_crs_string(), "EPSG:3857"); |
There was a problem hiding this comment.
I don't think this should be true...to_crs_string() should give the original string back verbatim.
| // Prefer the authority code; otherwise the unescaped CRS string | ||
| // (`to_crs_string`, not `to_json`) so the result round-trips | ||
| // back through ST_SetCRS / deserialize_crs — e.g. a raw WKT | ||
| // rather than a JSON-quoted one. | ||
| Some( | ||
| crs.to_authority_code()? | ||
| .unwrap_or_else(|| crs.to_crs_string()), | ||
| ) |
There was a problem hiding this comment.
I don't think we should be canonicalizing on the way in...this looses information.
| // WKT with an EPSG authority -> ST_CRS returns the authority code. | ||
| let crs = deserialize_crs(WKT_3857).unwrap(); | ||
| let tester = ScalarUdfTester::new(udf.clone(), vec![SedonaType::Wkb(Edges::Planar, crs)]); | ||
| let result = tester.invoke_scalar("POINT (0 1)").unwrap(); | ||
| tester.assert_scalar_result_equals(result, "EPSG:3857"); |
There was a problem hiding this comment.
This is definitely consistent with previous behaviour, although I don't actually like the previous behaviour (I think it's confusing that the CRS you'd get by inspecting the type is different than the CRS you'd get from calling ST_Crs().
| // A WKT carrying an EPSG authority canonicalizes to that code via normalize(). | ||
| let rasters = generate_test_rasters(1, None).unwrap(); | ||
| let result = tester | ||
| .invoke_array_scalar(Arc::new(rasters), WKT_3857) | ||
| .unwrap(); | ||
| let result_struct = result.as_any().downcast_ref::<StructArray>().unwrap(); | ||
| let raster = RasterStructArray::new(result_struct).get(0).unwrap(); | ||
| assert_eq!(raster.crs(), Some("EPSG:3857")); | ||
|
|
There was a problem hiding this comment.
For rasters I'm not sure...we always store rasters as an item-level crs where the compactness of authority:code matters (although with the string view representation and the optimization that was just added to this file, there shouldn't be a memory or performance issue). On the other hand, collapsing to authority:code on the way in looses information. Probably authority:code is fine until somebody notices.
| #[derive(Debug)] | ||
| struct Wkt { | ||
| wkt: String, | ||
| authority: Option<String>, |
There was a problem hiding this comment.
There are a few places here that just say "authority" which should probably be named "authority_code" to minimize confusion (unless you are only caching the authority)
| static WKT_AUTHORITY_RE: LazyLock<Regex> = LazyLock::new(|| { | ||
| Regex::new(r#"(?i)\b(?:AUTHORITY|ID)\s*[\[(]\s*"([^"]+)"\s*,\s*"?([^"\],)]+)"?"#) | ||
| .expect("valid WKT authority regex") | ||
| }); |
There was a problem hiding this comment.
This is a bit of a hack and I don't think it's too much extra work to do a bit better. WKT is very JSONish
enum WktNode<'a> {
Unquoted(&'a str), // usually a number but you don't have to parse it; starts with `[0-9.-]`
Quoted(&'a str), // a string, can be serialized with a JSON parser; starts with `"`
Tagged(&'a str, Vec<Box<WktNode<'a>>>) // TAG[<nodes, comma separated>]; starts with anything else but must contain a `[]`.
}Feel free to open a follow up ticket for that.
| /// A CRS given as a WKT1/WKT2 string. | ||
| /// | ||
| /// The WKT is preserved verbatim (`to_crs_string`) so it can be handed to PROJ | ||
| /// for transforms, while an `authority:code` is extracted once at construction | ||
| /// for equality, SRID, and geographic-parameter lookups. | ||
| #[derive(Debug)] | ||
| struct Wkt { |
There was a problem hiding this comment.
Can we call this WktCrs? (Wkt is already a thing we use...the concrete geo-traits class that WKT parses into)
| /// Emit the extracted `authority:code` when available (so round-tripped | ||
| /// metadata is compact and recognized), otherwise the WKT itself. Both are | ||
| /// re-parseable by [`deserialize_crs`]. | ||
| fn to_json(&self) -> String { | ||
| let value = self.authority.as_deref().unwrap_or(&self.wkt); | ||
| Value::String(value.to_string()).to_string() | ||
| } |
There was a problem hiding this comment.
I think you actually want Value::String(self.wkt).to_string() here. The Crs doesn't take sides about representations...if the consumer wants authority:code they can ask for it.
| fn geographic_params(&self) -> Result<Option<GeographicCrsParams>> { | ||
| Ok(self | ||
| .authority | ||
| .as_deref() | ||
| .and_then(geographic_params_for_authority)) | ||
| } |
There was a problem hiding this comment.
If we had some basic parsing we could extract these rather than restrict to the hard-coded authority/code combos
| assert!(looks_like_wkt(" geogcs[\"x\"]")); // leading ws + lowercase | ||
| assert!(!looks_like_wkt("EPSG:4326")); | ||
| assert!(!looks_like_wkt(r#"{"type":"GeographicCRS"}"#)); | ||
| assert!(!looks_like_wkt(r#"PROJECTILE["not a crs"]"#)); |
There was a problem hiding this comment.
I would like this to be a CRS even if it isn't one 🙂
…hority field to authority_code Addresses review: `Wkt` collided with the geo-traits WKT type, and the `authority` field holds an `authority:code` string. No behavior change.
This PR aims to support WKT1/2 through all uses of the crs.rs functionality. This includes ST_Transform, ST_CRS, and ST_SetCRS. A couple things worth pointing out that are somewhat opinionated or potentially controversial:
looks_like_wkt), keeping the else branch as PROJJSON sodeserialize_crsstill errors on unrecognized input. The alternative — drop the keyword list and treat any non-PROJJSON string as an opaque, PROJ-consumable definition (also accepting proj4 strings and any WKT2 keyword the list misses) — is simpler but makesdeserialize_crslenient (defers validation to PROJ, PostGIS-style) and would renameWkt. I lean toward the opaque fallback, but want a deliberate decision sincedeserialize_crsis core with many callers.ST_CRSabbreviates to the authority code on read, whileRS_CRSreturns the stored CRS string verbatim (rasters abbreviate atRS_SetCRS/normalizetime instead). They agree when the CRS is set via the UDFs, but diverge for a raw WKT-with-authority stored directly. This predates the PR (it also affects e.g. a stored PROJJSON-with-id) — WKT just makes it more visible. ShouldRS_CRSbe aligned to abbreviate on read, or left as-is?I am happy to revise these behaviors based on what the group thinks! The equality one specifically I'm not happy with, but I didn't want to bring proj into sedona-schema.