Skip to content

Mask compression codec#306

Draft
yfukai wants to merge 37 commits into
royerlab:mainfrom
yfukai:mask_compression_codec
Draft

Mask compression codec#306
yfukai wants to merge 37 commits into
royerlab:mainfrom
yfukai:mask_compression_codec

Conversation

@yfukai

@yfukai yfukai commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Human summary

Allows the user to change the compression codec for masks.

yfukai and others added 30 commits February 17, 2026 13:26
The previous commit (db35287) mistakenly deleted upstream/main's _SQLIDSet
scratch-table machinery, _create_id_scratch_table, the out_degree/copy
bound-variable handling, and the three scratch-table tests — they were diffed
against a stale fork main and wrongly treated as PR-added code. Restore them
verbatim from main; the struct-attr column-expansion simplification from the
previous commit is kept.
yfukai and others added 7 commits June 8, 2026 01:41
Masks were registered as pl.Object and round-tripped through pickle,
leaving the bbox locked inside an opaque blob. They are now stored as
pl.Struct({min_(z)yx, max_(z)yx: Int64, data: Binary}) so bbox fields
are natively filterable via NodeAttr("mask").struct.field(...), while
the binary mask stays blosc2-compressed in the data field.

- Mask.struct_dtype() / to_struct() / from_struct() conversion API,
  plus as_mask() to coerce struct dicts and legacy Mask objects alike
- RegionPropsNodes and MaskDiskAttrs register the struct dtype and
  write struct values
- consumers (GraphArrayView, IoUEdgeAttr, MaskMatching, ctc metrics,
  compute_overlaps, to_geff) materialize masks via as_mask(), so
  legacy pl.Object mask attributes keep working
- ctc metrics skip the pickle-to-bytes multiprocessing shim for
  struct mask columns, which are Arrow-native

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolve conflicts from upstream's struct-attribute PR (royerlab#268) and the
single->bulk add_node/add_edge refactor:

- attrs.py: take upstream's `columns` property and `f.to_attr().expr`
  filter reduction (handles compound AttrFilters as well as struct-field
  AttrComparisons).
- _rustworkx_graph.py: drop the now-duplicate nested `_extract_field_path`;
  use upstream's module-level `_eval_filter` (supports AND/OR/XOR/NOT).
- _sql_graph.py: adopt upstream's `_to_sql_clause` and the removal of the
  single `add_node`/`add_edge` methods (BaseGraph now delegates to the bulk
  variants). Mask struct flattening is preserved via `_flatten_attrs_for_write`
  on the bulk write paths.

Mask remains stored as a struct attribute (this branch's feature); upstream
left Mask as pl.Object.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Mask struct's `data` leaf (blosc2-compressed bytes) was stored through a
SQLAlchemy PickleType column, wrapping the already-compressed bytes in a second
pickle layer on every write and unpickling on every read.

Map `pl.Binary` to `sa.LargeBinary` so binary bytes are stored as a raw BLOB.
Pickling is now detected from the actual SQL column type (PickleType only, not
LargeBinary):

- `_is_pickled_sql_type` returns True only for PickleType columns.
- `unpickle_bytes_columns` takes the explicit set of pickled physical columns
  so raw-binary columns (e.g. the Mask `data` leaf) are left untouched.
- `_restore_pickled_column_types` re-tags reflected LargeBinary columns as
  PickleType except genuine raw-binary columns (schema dtype pl.Binary), which
  reflection cannot otherwise distinguish from pickled blobs. Reordered
  `_define_schema` so the attribute schemas are available when this runs.

Adds a regression test asserting the Mask `data` leaf is a raw LargeBinary
column (not PickleType) before and after reload, and that the mask round-trips
and struct-field filtering still works.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Introduce `MaskCodec` (BLOSC2, RAW, PACKBITS) so callers can trade encode cost
against stored size. The codec is recorded as the first byte of the packed
payload, making each mask self-describing: a single column can mix codecs and
any encoding stays readable regardless of the current default.

- `set_default_mask_codec` / `get_default_mask_codec` expose the default codec
  used when none is passed; `Mask.to_struct(codec=...)` overrides per call.
- RAW and PACKBITS prepend a tiny shape header (ndim + uint32 dims) so unpacking
  needs no external metadata; BLOSC2 keeps using its self-describing cframe.
- RAW decodes as a zero-copy read-only view; `Mask.__isub__` now copies-on-write
  so in-place difference still works on such masks.

Default remains BLOSC2, so existing behavior is unchanged. PACKBITS is typically
far smaller for small/medium cell masks (e.g. 26 B vs 347 B for an 11x11 disk).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants