Skip to content

Fix duplicate filtering path in Arrow task batches#3448

Open
GayathriSrividya wants to merge 6 commits into
apache:mainfrom
GayathriSrividya:fix/issue-3272-task-to-record-batches-segfault
Open

Fix duplicate filtering path in Arrow task batches#3448
GayathriSrividya wants to merge 6 commits into
apache:mainfrom
GayathriSrividya:fix/issue-3272-task-to-record-batches-segfault

Conversation

@GayathriSrividya

Copy link
Copy Markdown

Closes #3272

What this changes

This PR updates the Arrow scan path in _task_to_record_batches to avoid redundant filtering when there are no positional deletes.

  • Keeps predicate pushdown in Scanner.from_fragment as the only filter path when positional_deletes is absent.
  • Applies current_batch.filter(pyarrow_filter) only in the positional-delete path, after deletes are applied.
  • Preserves empty-batch handling after both delete application and conditional filtering.

Why

The previous flow could perform an extra table-level refilter even when the scanner already applied the predicate. This change removes that stale workaround path while keeping correct behavior for positional delete scenarios.

Tests

Added regression coverage in tests/io/test_pyarrow.py:

  • test_task_to_record_batches_filter_without_positional_deletes_avoids_table_refilter
  • test_task_to_record_batches_filter_with_positional_deletes_handles_empty_batch

Validated locally:

  • python -m pytest tests/io/test_pyarrow.py -q -k "task_to_record_batches_nanos or filter_without_positional_deletes_avoids_table_refilter or filter_with_positional_deletes_handles_empty_batch"
  • make lint

Comment thread tests/io/test_pyarrow.py Outdated
Replace the two weak tests that passed on the unfixed code with proper
regression coverage:

1. test_task_to_record_batches_does_not_use_table_filter_without_positional_deletes
   Replaces pyarrow.Table with a sentinel class whose from_batches raises
   AssertionError.  A custom metaclass preserves isinstance checks so the
   rest of the code-path is unaffected.  With the fix the sentinel is never
   reached; without the fix it fires immediately, detecting the old
   pa.Table.from_batches / filter / combine_chunks / to_batches[0] path
   that caused the SIGSEGV on Apple Silicon.

2. test_task_to_record_batches_filter_applied_after_positional_deletes
   Uses data [1,2,3,4,5] with positional deletes on positions 1 and 3
   (removing values 2 and 4), then applies filter id > 2.  Expected
   result [3, 5] would be wrong if either deletes or the subsequent
   filter were skipped.

@Fokko Fokko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch @GayathriSrividya This logic must have been messed up somewhere throughout refactoring.

I left one comment to avoid having two blocks for positional_deletes, let me know what you think. This will reduce the number of lines quite a bit.

Comment thread pyiceberg/io/pyarrow.py
- Combine the two positional-delete handling blocks into one: apply
  take(indices) and the row filter together inside the single
  'if positional_deletes' block, removing the redundant mid-loop
  empty-batch check (filtering an empty batch is free).
- Replace the sentinel-based regression test (which passed even when
  the fix was reverted) with a behavioral test that picks a scenario
  where the old bug produces a distinct wrong answer:
    data=[1,2,3,4], pos_delete=2 (value 3), filter=id>2
    correct: [4]   old bug (scanner pre-filters): [3,4]
  The assertion on [4] will fail against any regression.

Addresses review feedback from @ebyhr and @Fokko.

@Fokko Fokko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks @GayathriSrividya 🙌

@kevinjqliu kevinjqliu added this to the PyIceberg 0.12.0 milestone Jun 4, 2026
Comment thread pyiceberg/io/pyarrow.py Outdated
PyArrow < 21 raises IndexError when RecordBatch.filter(Expression)
produces zero rows. Wrap the call in try/except and fall back to an
empty slice, which is already handled by the num_rows == 0 guard below.

Add regression test covering the positional-delete path where the
post-delete filter eliminates all remaining rows.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread pyiceberg/io/pyarrow.py
GayathriSrividya and others added 2 commits June 8, 2026 13:45
Co-authored-by: Fokko Driesprong <fokko@apache.org>
…ndexError

RecordBatch.filter(Expression) raises IndexError on PyArrow 17-20 when
the result has zero rows (fixed upstream in apache/arrow#46057). Replace
the try/except approach with the Table-based workaround that avoids the
bug, consistent with the pattern already used for the non-positional-
deletes path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_task_to_record_batches stale PyArrow workaround causes SIGSEGV on Apple Silicon

5 participants