Skip to content

feat(rust/sedona-schema): support WKT1/WKT2 CRS strings in deserialize_crs#953

Draft
james-willis wants to merge 13 commits into
apache:mainfrom
james-willis:jw/crs-wkt-support
Draft

feat(rust/sedona-schema): support WKT1/WKT2 CRS strings in deserialize_crs#953
james-willis wants to merge 13 commits into
apache:mainfrom
james-willis:jw/crs-wkt-support

Conversation

@james-willis

@james-willis james-willis commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

This PR aims to support WKT1/2 through all uses of the crs.rs functionality. This includes ST_Transform, ST_CRS, and ST_SetCRS. A couple things worth pointing out that are somewhat opinionated or potentially controversial:

  • Authority-less CRSes (e.g. custom LCC): transforms work (verbatim WKT → PROJ), but there's no SRID and equality degrades to exact-string match — conservative (can false-negative across different spellings, never false-positive); semantic equivalence would need PROJ at the sedona-proj layer.
  • Canonicalization: a WKT carrying an embedded authority (AUTHORITY/ID) canonicalizes to that code (EPSG:3857) when it crosses a field-metadata boundary — desirable, since PROJ treats the code and WKT identically; authority-less WKT is kept verbatim.
  • WKT detection vs. opaque fallback: WKT is currently gated behind a keyword check (looks_like_wkt), keeping the else branch as PROJJSON so deserialize_crs still errors on unrecognized input. The alternative — drop the keyword list and treat any non-PROJJSON string as an opaque, PROJ-consumable definition (also accepting proj4 strings and any WKT2 keyword the list misses) — is simpler but makes deserialize_crs lenient (defers validation to PROJ, PostGIS-style) and would rename Wkt. I lean toward the opaque fallback, but want a deliberate decision since deserialize_crs is core with many callers.
  • RS_CRS vs ST_CRS read behavior (preexisting inconsistency): ST_CRS abbreviates to the authority code on read, while RS_CRS returns the stored CRS string verbatim (rasters abbreviate at RS_SetCRS/normalize time instead). They agree when the CRS is set via the UDFs, but diverge for a raw WKT-with-authority stored directly. This predates the PR (it also affects e.g. a stored PROJJSON-with-id) — WKT just makes it more visible. Should RS_CRS be aligned to abbreviate on read, or left as-is?

I am happy to revise these behaviors based on what the group thinks! The equality one specifically I'm not happy with, but I didn't want to bring proj into sedona-schema.

@github-actions github-actions Bot requested a review from prantogg June 12, 2026 22:13
Some("EPSG:3857")
);
assert_eq!(crs.srid().unwrap(), Some(3857));
assert_eq!(crs.to_crs_string(), "EPSG:3857");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be true...to_crs_string() should give the original string back verbatim.

Comment thread rust/sedona-functions/src/st_srid.rs Outdated
Comment on lines +180 to +187
// Prefer the authority code; otherwise the unescaped CRS string
// (`to_crs_string`, not `to_json`) so the result round-trips
// back through ST_SetCRS / deserialize_crs — e.g. a raw WKT
// rather than a JSON-quoted one.
Some(
crs.to_authority_code()?
.unwrap_or_else(|| crs.to_crs_string()),
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be canonicalizing on the way in...this looses information.

Comment on lines +482 to +486
// WKT with an EPSG authority -> ST_CRS returns the authority code.
let crs = deserialize_crs(WKT_3857).unwrap();
let tester = ScalarUdfTester::new(udf.clone(), vec![SedonaType::Wkb(Edges::Planar, crs)]);
let result = tester.invoke_scalar("POINT (0 1)").unwrap();
tester.assert_scalar_result_equals(result, "EPSG:3857");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely consistent with previous behaviour, although I don't actually like the previous behaviour (I think it's confusing that the CRS you'd get by inspecting the type is different than the CRS you'd get from calling ST_Crs().

Comment on lines +516 to +524
// A WKT carrying an EPSG authority canonicalizes to that code via normalize().
let rasters = generate_test_rasters(1, None).unwrap();
let result = tester
.invoke_array_scalar(Arc::new(rasters), WKT_3857)
.unwrap();
let result_struct = result.as_any().downcast_ref::<StructArray>().unwrap();
let raster = RasterStructArray::new(result_struct).get(0).unwrap();
assert_eq!(raster.crs(), Some("EPSG:3857"));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rasters I'm not sure...we always store rasters as an item-level crs where the compactness of authority:code matters (although with the string view representation and the optimization that was just added to this file, there shouldn't be a memory or performance issue). On the other hand, collapsing to authority:code on the way in looses information. Probably authority:code is fine until somebody notices.

Comment thread rust/sedona-schema/src/crs.rs Outdated
#[derive(Debug)]
struct Wkt {
wkt: String,
authority: Option<String>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few places here that just say "authority" which should probably be named "authority_code" to minimize confusion (unless you are only caching the authority)

static WKT_AUTHORITY_RE: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r#"(?i)\b(?:AUTHORITY|ID)\s*[\[(]\s*"([^"]+)"\s*,\s*"?([^"\],)]+)"?"#)
.expect("valid WKT authority regex")
});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a hack and I don't think it's too much extra work to do a bit better. WKT is very JSONish

enum WktNode<'a> {
   Unquoted(&'a str), // usually a number but you don't have to parse it; starts with `[0-9.-]`
   Quoted(&'a str), // a string, can be serialized with a JSON parser; starts with `"`
   Tagged(&'a str, Vec<Box<WktNode<'a>>>) // TAG[<nodes, comma separated>]; starts with anything else but must contain a `[]`.
}

Feel free to open a follow up ticket for that.

Comment thread rust/sedona-schema/src/crs.rs Outdated
Comment on lines +756 to +762
/// A CRS given as a WKT1/WKT2 string.
///
/// The WKT is preserved verbatim (`to_crs_string`) so it can be handed to PROJ
/// for transforms, while an `authority:code` is extracted once at construction
/// for equality, SRID, and geographic-parameter lookups.
#[derive(Debug)]
struct Wkt {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this WktCrs? (Wkt is already a thing we use...the concrete geo-traits class that WKT parses into)

Comment on lines +775 to +781
/// Emit the extracted `authority:code` when available (so round-tripped
/// metadata is compact and recognized), otherwise the WKT itself. Both are
/// re-parseable by [`deserialize_crs`].
fn to_json(&self) -> String {
let value = self.authority.as_deref().unwrap_or(&self.wkt);
Value::String(value.to_string()).to_string()
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you actually want Value::String(self.wkt).to_string() here. The Crs doesn't take sides about representations...if the consumer wants authority:code they can ask for it.

Comment on lines +812 to +817
fn geographic_params(&self) -> Result<Option<GeographicCrsParams>> {
Ok(self
.authority
.as_deref()
.and_then(geographic_params_for_authority))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had some basic parsing we could extract these rather than restrict to the hard-coded authority/code combos

assert!(looks_like_wkt(" geogcs[\"x\"]")); // leading ws + lowercase
assert!(!looks_like_wkt("EPSG:4326"));
assert!(!looks_like_wkt(r#"{"type":"GeographicCRS"}"#));
assert!(!looks_like_wkt(r#"PROJECTILE["not a crs"]"#));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this to be a CRS even if it isn't one 🙂

…nstead of collapsing to authority:code

ST_Crs returned the authority code when a CRS carried one, discarding the
original PROJJSON/WKT definition. CachedCrsNormalization (shared by
ST_SetCRS and RS_SetCRS for item-level CRS) did the same on the way in.
Both now emit the round-trippable to_crs_string(): an authority:code CRS
stays compact, while a definition-based CRS is preserved verbatim.

The per-row CRS arrays are built with StringViewBuilder::with_deduplicate_strings(),
so a batch that repeats the same (possibly large) definition stores the
bytes once and shares them across rows.
…+ dedup/null parity)

Address review on apache#962:
- crs.rs: add an id-less PROJJSON case to normalize_preserves_definition so
  round-trip equality exercises ProjJSON::crs_equals' full-body comparison
  rather than the authority-code short circuit.
- st_setsrid/rs_setsrid: add multi-row tests for normalize_crs_array /
  normalize_crs_columnar covering null placement, full-definition
  preservation, and that with_deduplicate_strings() stores a repeated
  definition in the shared buffer exactly once.
- Fix a stale RS_SetCRS comment that still described authority:code collapse.
PROJJSON is stored as a parsed serde_json tree and re-serialized via
to_crs_string(), so object keys may be reordered (BTreeMap) and whitespace
normalized. The definition is preserved in full semantically and round-trips,
but is not byte-identical. WKT is byte-verbatim (the raw string is retained).
A raster batch is typically high-cardinality in records but low-cardinality
in CRS, so normalize_crs_columnar now handles the two cases distinctly:

- Scalar CRS: normalize once, return a single-element array; broadcast_string_view
  fans it out across rows via try_append_value_n, storing the definition once.
- Array CRS: normalize per row with a deduplicating builder, so the few distinct
  definitions in a large column are stored once in the shared view buffer.

Adds scalar-path and broadcast buffer-sharing tests.
…default method

Every impl derived the SRID the same way from to_authority_code() (lon/lat
alias -> 4326, EPSG:{code} -> code, else None). Hoist that into a default
trait method and drop the identical AuthorityCode/ProjJSON implementations.
Behavior is unchanged.
…lize only in metadata

deserialize_crs no longer folds EPSG:4326 to OGC:CRS84, so ST_Crs/RS_Crs and
SetCRS round-trip the user's authority code unchanged. The lon/lat equivalence
moves from a deserialize-time fold to the single source of truth in crs_equals
(via authority_codes_equal), and PartialEq now just delegates to it. To keep
GeoParquet/GeoArrow axis-order-explicit, AuthorityCode::to_json canonicalizes
the lnglat alias to OGC:CRS84 while to_crs_string preserves the input.

Removes the now-unused is_str_lnglat/is_lnglat. SRID 4326 -> OGC:CRS84 (the
numeric SRID path) is unchanged.
The fixed-CRS-WKB -> item_crs widening stored to_authority_code(), collapsing a
full PROJJSON/WKT definition to its embedded code; switch to to_crs_string() so
the item-level CRS column keeps the full definition (read back via
deserialize_crs), consistent with normalize_crs_array/RS_SetCRS. Also add a
GeoArrow-metadata serialization test covering the EPSG:4326 -> OGC:CRS84
canonicalization through serialize_edges_and_crs.
…e_crs

Add a Wkt CRS type that carries the WKT verbatim (so PROJ can consume it
directly for transforms) and extracts a best-effort authority:code from the
trailing AUTHORITY[...] / ID[...] tag for equality, SRID, and geographic-
parameter lookups. WKT is detected by leading keyword and routed ahead of the
PROJJSON fallback in deserialize_crs.

Adds consumer coverage (ST_Transform WKT1/WKT2, ST_SRID, ST_SetCRS, ST_CRS)
and st_setcrs/st_crs doc callouts for the authority-code canonicalization.
ST_CRS fell back to to_json() for a CRS with no authority code, which JSON-
escapes the value (e.g. a WKT came back wrapped in quotes and could not be fed
back into ST_SetCRS). Use to_crs_string() instead — the unescaped form intended
as input to PROJ/GDAL — so an authority-less WKT round-trips verbatim.
…RS/RS_SetCRS

Mirror the ST_* WKT tests on the raster side: RS_SRID resolves a WKT with an
EPSG authority to its SRID and errors on an authority-less WKT; RS_CRS echoes a
stored WKT verbatim; RS_SetCRS canonicalizes an EPSG-bearing WKT to its
authority code (via normalize) and stores an authority-less WKT verbatim.

(No RS_Transform UDF exists, so there's nothing to cover there.)
…hority field to authority_code

Addresses review: `Wkt` collided with the geo-traits WKT type, and the
`authority` field holds an `authority:code` string. No behavior change.
Sits on top of the CRS cleanup: drop WktCrs::srid (the trait default derives it
from to_authority_code), and route WktCrs::crs_equals through the shared
authority_codes_equal helper so the lon/lat alias rule lives in one place.

Update the RS_SetCRS WKT test: a WKT carrying an EPSG authority is now preserved
verbatim (definitions are no longer collapsed to their embedded code) and still
compares equal to that authority code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants