Skip to content

feat(rust/sedona-schema): support WKT1/WKT2 CRS strings in deserialize_crs#953

Draft
james-willis wants to merge 4 commits into
apache:mainfrom
james-willis:jw/crs-wkt-support
Draft

feat(rust/sedona-schema): support WKT1/WKT2 CRS strings in deserialize_crs#953
james-willis wants to merge 4 commits into
apache:mainfrom
james-willis:jw/crs-wkt-support

Conversation

@james-willis

@james-willis james-willis commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This PR aims to support WKT1/2 through all uses of the crs.rs functionality. This includes ST_Transform, ST_CRS, and ST_SetCRS. A couple things worth pointing out that are somewhat opinionated or potentially controversial:

  • Authority-less CRSes (e.g. custom LCC): transforms work (verbatim WKT → PROJ), but there's no SRID and equality degrades to exact-string match — conservative (can false-negative across different spellings, never false-positive); semantic equivalence would need PROJ at the sedona-proj layer.
  • Canonicalization: a WKT carrying an embedded authority (AUTHORITY/ID) canonicalizes to that code (EPSG:3857) when it crosses a field-metadata boundary — desirable, since PROJ treats the code and WKT identically; authority-less WKT is kept verbatim.
  • WKT detection vs. opaque fallback: WKT is currently gated behind a keyword check (looks_like_wkt), keeping the else branch as PROJJSON so deserialize_crs still errors on unrecognized input. The alternative — drop the keyword list and treat any non-PROJJSON string as an opaque, PROJ-consumable definition (also accepting proj4 strings and any WKT2 keyword the list misses) — is simpler but makes deserialize_crs lenient (defers validation to PROJ, PostGIS-style) and would rename Wkt. I lean toward the opaque fallback, but want a deliberate decision since deserialize_crs is core with many callers.
  • RS_CRS vs ST_CRS read behavior (preexisting inconsistency): ST_CRS abbreviates to the authority code on read, while RS_CRS returns the stored CRS string verbatim (rasters abbreviate at RS_SetCRS/normalize time instead). They agree when the CRS is set via the UDFs, but diverge for a raw WKT-with-authority stored directly. This predates the PR (it also affects e.g. a stored PROJJSON-with-id) — WKT just makes it more visible. Should RS_CRS be aligned to abbreviate on read, or left as-is?

I am happy to revise these behaviors based on what the group thinks! The equality one specifically I'm not happy with, but I didn't want to bring proj into sedona-schema.

…e_crs

Add a Wkt CRS type that carries the WKT verbatim (so PROJ can consume it
directly for transforms) and extracts a best-effort authority:code from the
trailing AUTHORITY[...] / ID[...] tag for equality, SRID, and geographic-
parameter lookups. WKT is detected by leading keyword and routed ahead of the
PROJJSON fallback in deserialize_crs.

Adds consumer coverage (ST_Transform WKT1/WKT2, ST_SRID, ST_SetCRS, ST_CRS)
and st_setcrs/st_crs doc callouts for the authority-code canonicalization.
@github-actions github-actions Bot requested a review from prantogg June 12, 2026 22:13
ST_CRS fell back to to_json() for a CRS with no authority code, which JSON-
escapes the value (e.g. a WKT came back wrapped in quotes and could not be fed
back into ST_SetCRS). Use to_crs_string() instead — the unescaped form intended
as input to PROJ/GDAL — so an authority-less WKT round-trips verbatim.
…RS/RS_SetCRS

Mirror the ST_* WKT tests on the raster side: RS_SRID resolves a WKT with an
EPSG authority to its SRID and errors on an authority-less WKT; RS_CRS echoes a
stored WKT verbatim; RS_SetCRS canonicalizes an EPSG-bearing WKT to its
authority code (via normalize) and stores an authority-less WKT verbatim.

(No RS_Transform UDF exists, so there's nothing to cover there.)
Some("EPSG:3857")
);
assert_eq!(crs.srid().unwrap(), Some(3857));
assert_eq!(crs.to_crs_string(), "EPSG:3857");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be true...to_crs_string() should give the original string back verbatim.

Comment on lines +180 to +187
// Prefer the authority code; otherwise the unescaped CRS string
// (`to_crs_string`, not `to_json`) so the result round-trips
// back through ST_SetCRS / deserialize_crs — e.g. a raw WKT
// rather than a JSON-quoted one.
Some(
crs.to_authority_code()?
.unwrap_or_else(|| crs.to_crs_string()),
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be canonicalizing on the way in...this looses information.

Comment on lines +482 to +486
// WKT with an EPSG authority -> ST_CRS returns the authority code.
let crs = deserialize_crs(WKT_3857).unwrap();
let tester = ScalarUdfTester::new(udf.clone(), vec![SedonaType::Wkb(Edges::Planar, crs)]);
let result = tester.invoke_scalar("POINT (0 1)").unwrap();
tester.assert_scalar_result_equals(result, "EPSG:3857");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely consistent with previous behaviour, although I don't actually like the previous behaviour (I think it's confusing that the CRS you'd get by inspecting the type is different than the CRS you'd get from calling ST_Crs().

Comment on lines +516 to +524
// A WKT carrying an EPSG authority canonicalizes to that code via normalize().
let rasters = generate_test_rasters(1, None).unwrap();
let result = tester
.invoke_array_scalar(Arc::new(rasters), WKT_3857)
.unwrap();
let result_struct = result.as_any().downcast_ref::<StructArray>().unwrap();
let raster = RasterStructArray::new(result_struct).get(0).unwrap();
assert_eq!(raster.crs(), Some("EPSG:3857"));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rasters I'm not sure...we always store rasters as an item-level crs where the compactness of authority:code matters (although with the string view representation and the optimization that was just added to this file, there shouldn't be a memory or performance issue). On the other hand, collapsing to authority:code on the way in looses information. Probably authority:code is fine until somebody notices.

Comment thread rust/sedona-schema/src/crs.rs Outdated
#[derive(Debug)]
struct Wkt {
wkt: String,
authority: Option<String>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few places here that just say "authority" which should probably be named "authority_code" to minimize confusion (unless you are only caching the authority)

static WKT_AUTHORITY_RE: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r#"(?i)\b(?:AUTHORITY|ID)\s*[\[(]\s*"([^"]+)"\s*,\s*"?([^"\],)]+)"?"#)
.expect("valid WKT authority regex")
});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a hack and I don't think it's too much extra work to do a bit better. WKT is very JSONish

enum WktNode<'a> {
   Unquoted(&'a str), // usually a number but you don't have to parse it; starts with `[0-9.-]`
   Quoted(&'a str), // a string, can be serialized with a JSON parser; starts with `"`
   Tagged(&'a str, Vec<Box<WktNode<'a>>>) // TAG[<nodes, comma separated>]; starts with anything else but must contain a `[]`.
}

Feel free to open a follow up ticket for that.

Comment thread rust/sedona-schema/src/crs.rs Outdated
Comment on lines +756 to +762
/// A CRS given as a WKT1/WKT2 string.
///
/// The WKT is preserved verbatim (`to_crs_string`) so it can be handed to PROJ
/// for transforms, while an `authority:code` is extracted once at construction
/// for equality, SRID, and geographic-parameter lookups.
#[derive(Debug)]
struct Wkt {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this WktCrs? (Wkt is already a thing we use...the concrete geo-traits class that WKT parses into)

Comment on lines +775 to +781
/// Emit the extracted `authority:code` when available (so round-tripped
/// metadata is compact and recognized), otherwise the WKT itself. Both are
/// re-parseable by [`deserialize_crs`].
fn to_json(&self) -> String {
let value = self.authority.as_deref().unwrap_or(&self.wkt);
Value::String(value.to_string()).to_string()
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you actually want Value::String(self.wkt).to_string() here. The Crs doesn't take sides about representations...if the consumer wants authority:code they can ask for it.

Comment on lines +812 to +817
fn geographic_params(&self) -> Result<Option<GeographicCrsParams>> {
Ok(self
.authority
.as_deref()
.and_then(geographic_params_for_authority))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had some basic parsing we could extract these rather than restrict to the hard-coded authority/code combos

assert!(looks_like_wkt(" geogcs[\"x\"]")); // leading ws + lowercase
assert!(!looks_like_wkt("EPSG:4326"));
assert!(!looks_like_wkt(r#"{"type":"GeographicCRS"}"#));
assert!(!looks_like_wkt(r#"PROJECTILE["not a crs"]"#));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this to be a CRS even if it isn't one 🙂

…hority field to authority_code

Addresses review: `Wkt` collided with the geo-traits WKT type, and the
`authority` field holds an `authority:code` string. No behavior change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants