Mask compression codec#306
Draft
yfukai wants to merge 37 commits into
Draft
Conversation
The previous commit (db35287) mistakenly deleted upstream/main's _SQLIDSet scratch-table machinery, _create_id_scratch_table, the out_degree/copy bound-variable handling, and the three scratch-table tests — they were diffed against a stale fork main and wrongly treated as PR-added code. Restore them verbatim from main; the struct-attr column-expansion simplification from the previous commit is kept.
Masks were registered as pl.Object and round-tripped through pickle,
leaving the bbox locked inside an opaque blob. They are now stored as
pl.Struct({min_(z)yx, max_(z)yx: Int64, data: Binary}) so bbox fields
are natively filterable via NodeAttr("mask").struct.field(...), while
the binary mask stays blosc2-compressed in the data field.
- Mask.struct_dtype() / to_struct() / from_struct() conversion API,
plus as_mask() to coerce struct dicts and legacy Mask objects alike
- RegionPropsNodes and MaskDiskAttrs register the struct dtype and
write struct values
- consumers (GraphArrayView, IoUEdgeAttr, MaskMatching, ctc metrics,
compute_overlaps, to_geff) materialize masks via as_mask(), so
legacy pl.Object mask attributes keep working
- ctc metrics skip the pickle-to-bytes multiprocessing shim for
struct mask columns, which are Arrow-native
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolve conflicts from upstream's struct-attribute PR (royerlab#268) and the single->bulk add_node/add_edge refactor: - attrs.py: take upstream's `columns` property and `f.to_attr().expr` filter reduction (handles compound AttrFilters as well as struct-field AttrComparisons). - _rustworkx_graph.py: drop the now-duplicate nested `_extract_field_path`; use upstream's module-level `_eval_filter` (supports AND/OR/XOR/NOT). - _sql_graph.py: adopt upstream's `_to_sql_clause` and the removal of the single `add_node`/`add_edge` methods (BaseGraph now delegates to the bulk variants). Mask struct flattening is preserved via `_flatten_attrs_for_write` on the bulk write paths. Mask remains stored as a struct attribute (this branch's feature); upstream left Mask as pl.Object. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Mask struct's `data` leaf (blosc2-compressed bytes) was stored through a SQLAlchemy PickleType column, wrapping the already-compressed bytes in a second pickle layer on every write and unpickling on every read. Map `pl.Binary` to `sa.LargeBinary` so binary bytes are stored as a raw BLOB. Pickling is now detected from the actual SQL column type (PickleType only, not LargeBinary): - `_is_pickled_sql_type` returns True only for PickleType columns. - `unpickle_bytes_columns` takes the explicit set of pickled physical columns so raw-binary columns (e.g. the Mask `data` leaf) are left untouched. - `_restore_pickled_column_types` re-tags reflected LargeBinary columns as PickleType except genuine raw-binary columns (schema dtype pl.Binary), which reflection cannot otherwise distinguish from pickled blobs. Reordered `_define_schema` so the attribute schemas are available when this runs. Adds a regression test asserting the Mask `data` leaf is a raw LargeBinary column (not PickleType) before and after reload, and that the mask round-trips and struct-field filtering still works. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Introduce `MaskCodec` (BLOSC2, RAW, PACKBITS) so callers can trade encode cost against stored size. The codec is recorded as the first byte of the packed payload, making each mask self-describing: a single column can mix codecs and any encoding stays readable regardless of the current default. - `set_default_mask_codec` / `get_default_mask_codec` expose the default codec used when none is passed; `Mask.to_struct(codec=...)` overrides per call. - RAW and PACKBITS prepend a tiny shape header (ndim + uint32 dims) so unpacking needs no external metadata; BLOSC2 keeps using its self-describing cframe. - RAW decodes as a zero-copy read-only view; `Mask.__isub__` now copies-on-write so in-place difference still works on such masks. Default remains BLOSC2, so existing behavior is unchanged. PACKBITS is typically far smaller for small/medium cell masks (e.g. 26 B vs 347 B for an 11x11 disk). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Human summary
Allows the user to change the compression codec for masks.