Skip to content

Add build-retry, portability, and precomputed image-tag map#103

Open
jhmblundin wants to merge 1 commit into
scaleapi:mainfrom
blitzy-showcase:upstream-contrib/build-retry-and-portability
Open

Add build-retry, portability, and precomputed image-tag map#103
jhmblundin wants to merge 1 commit into
scaleapi:mainfrom
blitzy-showcase:upstream-contrib/build-retry-and-portability

Conversation

@jhmblundin

@jhmblundin jhmblundin commented Jun 3, 2026

Copy link
Copy Markdown

This change improves resilience and ergonomics of swe_bench_pro_eval.py

What this changes

  • Retry sandbox creation on transient Modal image-build / registry pull errors (configurable via MAX_BUILD_RETRIES, BUILD_RETRY_DELAY). Prior behavior: a single flaky pull would fail the instance permanently and print "RemoteError" for that instance with no recovery.
  • Resolve dockerfile paths against the script's own directory via _REPO_ROOT, so the script can be invoked from any working directory instead of only the repo root.
  • Add helper_code/instance_to_tag_mapping.json, a precomputed instance_id -> Docker Hub tag mapping (731 entries) used in preference to the heuristic tag generator. Avoids edge-case mismatches such as element-hq__element-web's -vnan suffix handling. Falls back to the existing helper_code.image_uri heuristic if the mapping file is absent, so callers running without it are unaffected.
  • Add setup_dockerfile_commands to modal.Image.from_registry that best-effort install pip + requests on base images that lack them so parser.py can import third-party deps. Silently no-ops on images where pip is already present.
  • Pretty-print eval_results.json with indent=2 for human readability.

Backwards compatibility

  • eval_results.json schema is unchanged (still {instance_id: bool}).
  • Existing call sites of helper_code.image_uri.get_dockerhub_image_uri are still supported via the fallback path.
  • No new required dependencies.

Greptile Summary

This PR improves resilience and ergonomics of swe_bench_pro_eval.py by adding a configurable build-retry loop around Modal sandbox creation, resolving Dockerfile paths via _REPO_ROOT so the script works from any directory, and introducing a precomputed instance_to_tag_mapping.json that takes precedence over the existing heuristic URI generator.

  • Retry logic: wraps modal.Image.from_registry + modal.Sandbox.create in a for loop (up to MAX_BUILD_RETRIES = 3) and uses is_image_build_error to distinguish transient failures from hard failures; also adds setup_dockerfile_commands to best-effort install pip + requests on minimal base images.
  • Tag mapping: _load_instance_tag_map lazily loads 731 precomputed instance_id → Docker Hub tag entries with a fallback to the heuristic, and eval_results.json is now written with indent=2 for readability.

Confidence Score: 4/5

Safe to merge — the retry and tag-mapping paths degrade gracefully, and the path-portability fix is strictly additive.

The retry loop correctly handles all exit paths (break on success, return None on exhausted/non-transient failures), and the tag-map fallback to the heuristic means a missing or absent JSON file is benign. The two things worth watching: is_image_build_error uses broad indicators like 'registry' and 'failed with the exception' that could silently absorb non-transient errors (auth failures, quota errors, config mistakes), and a malformed instance_to_tag_mapping.json will cause _load_instance_tag_map to re-raise JSONDecodeError on every instance call rather than caching a safe empty-dict fallback.

swe_bench_pro_eval.py — specifically the is_image_build_error indicator list and the _load_instance_tag_map error path.

Important Files Changed

Filename Overview
swe_bench_pro_eval.py Adds build-retry loop, _REPO_ROOT portability fix, precomputed tag-map lookup, setup_dockerfile_commands for pip, and indent=2 on eval_results.json; retry and error-detection logic is mostly correct but the is_image_build_error indicators are broad enough to silently absorb non-transient failures, and a malformed JSON mapping file would repeatedly re-trigger parse errors on every instance.
helper_code/instance_to_tag_mapping.json New precomputed instance_id → Docker Hub tag mapping with 731 entries; data appears well-formed and the code treats this as an optional hint with a heuristic fallback.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[eval_with_modal called] --> B[get_dockerhub_image_uri]
    B --> C{uid in precomputed map?}
    C -- Yes --> D[return mapping URI]
    C -- No --> E[heuristic get_dockerhub_image_uri]
    D & E --> F[Start retry loop\nattempt 1..MAX_BUILD_RETRIES]
    F --> G[modal.App.lookup]
    G --> H[modal.Image.from_registry\n+ setup_dockerfile_commands]
    H --> I[modal.Sandbox.create]
    I -- success --> J[break — sandbox ready]
    I -- exception --> K{is_image_build_error?}
    K -- Yes AND attempts remain --> L[sleep BUILD_RETRY_DELAY\ncontinue loop]
    L --> F
    K -- No OR exhausted --> M[return None]
    J --> N[sandbox.exec entryscript]
    N --> O[collect outputs]
    O --> P[return result]
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
swe_bench_pro_eval.py:331-341
**Overly-broad error indicators may swallow legitimate failures**

`"failed with the exception"` and `"registry"` are generic enough to match a wide class of non-transient errors — for example, an auth failure like `"401 Unauthorized: access to registry denied"` or any Modal exception whose message contains "failed with the exception". These will be silently retried three times and then discarded as `None`, making root-cause debugging significantly harder. Consider narrowing the indicators to known-transient patterns (e.g., `"skopeo copy"`, `"image pull"`, `"remoteerror"`) and letting everything else propagate or surface as a distinct failure mode.

### Issue 2 of 3
swe_bench_pro_eval.py:77-88
**Malformed JSON permanently re-loads on every call**

If `instance_to_tag_mapping.json` exists but is malformed, `json.load(f)` raises `JSONDecodeError` before the assignment `_INSTANCE_TAG_MAP = json.load(f)` completes, leaving `_INSTANCE_TAG_MAP` as `None`. Every subsequent call re-opens and re-attempts to parse the same broken file. Caching the failure (e.g., setting `_INSTANCE_TAG_MAP = {}` in an except block and logging a warning) would fall through to the heuristic cleanly for all remaining instances rather than printing repeated parse errors for each one.

### Issue 3 of 3
swe_bench_pro_eval.py:370-373
`modal.App.lookup` is called on every retry iteration even though the result is always the same app object. Moving it outside the loop saves redundant API round-trips on retried attempts.

```suggestion
        app = modal.App.lookup(name="swe-bench-pro-eval", create_if_missing=True)
        for attempt in range(1, MAX_BUILD_RETRIES + 1):
            try:
                image = modal.Image.from_registry(
```

Reviews (1): Last reviewed commit: "Add build-retry, portability, and precom..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

This change improves resilience and ergonomics of swe_bench_pro_eval.py
without altering the existing eval_results.json schema.

What this changes
- Retry sandbox creation on transient Modal image-build / registry pull
  errors (configurable via MAX_BUILD_RETRIES, BUILD_RETRY_DELAY). Prior
  behavior: a single flaky pull would fail the instance permanently and
  print "RemoteError" for that instance with no recovery.
- Resolve dockerfile paths against the script's own directory via
  _REPO_ROOT, so the script can be invoked from any working directory
  instead of only the repo root.
- Add helper_code/instance_to_tag_mapping.json, a precomputed
  instance_id -> Docker Hub tag mapping (731 entries) used in preference
  to the heuristic tag generator. Avoids edge-case mismatches such as
  element-hq__element-web's -vnan suffix handling. Falls back to the
  existing helper_code.image_uri heuristic if the mapping file is
  absent, so callers running without it are unaffected.
- Add setup_dockerfile_commands to modal.Image.from_registry that
  best-effort install pip + requests on base images that lack them so
  parser.py can import third-party deps. Silently no-ops on images
  where pip is already present.
- Pretty-print eval_results.json with indent=2 for human readability.

Backwards compatibility
- eval_results.json schema is unchanged (still {instance_id: bool}).
- Existing call sites of helper_code.image_uri.get_dockerhub_image_uri
  are still supported via the fallback path.
- No new required dependencies.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread swe_bench_pro_eval.py
Comment on lines +331 to +341
indicators = (
"image build",
"skopeo copy",
"failed with the exception",
"remoteerror",
"image pull",
"registry",
)
if any(ind in error_str or ind in error_repr for ind in indicators):
return True
return "RemoteError" in type(error).__name__

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Overly-broad error indicators may swallow legitimate failures

"failed with the exception" and "registry" are generic enough to match a wide class of non-transient errors — for example, an auth failure like "401 Unauthorized: access to registry denied" or any Modal exception whose message contains "failed with the exception". These will be silently retried three times and then discarded as None, making root-cause debugging significantly harder. Consider narrowing the indicators to known-transient patterns (e.g., "skopeo copy", "image pull", "remoteerror") and letting everything else propagate or surface as a distinct failure mode.

Prompt To Fix With AI
This is a comment left during a code review.
Path: swe_bench_pro_eval.py
Line: 331-341

Comment:
**Overly-broad error indicators may swallow legitimate failures**

`"failed with the exception"` and `"registry"` are generic enough to match a wide class of non-transient errors — for example, an auth failure like `"401 Unauthorized: access to registry denied"` or any Modal exception whose message contains "failed with the exception". These will be silently retried three times and then discarded as `None`, making root-cause debugging significantly harder. Consider narrowing the indicators to known-transient patterns (e.g., `"skopeo copy"`, `"image pull"`, `"remoteerror"`) and letting everything else propagate or surface as a distinct failure mode.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Comment thread swe_bench_pro_eval.py
Comment on lines +77 to +88
def _load_instance_tag_map():
"""Load the instance-to-tag mapping from disk (once). Returns {} if file is absent."""
global _INSTANCE_TAG_MAP
if _INSTANCE_TAG_MAP is not None:
return _INSTANCE_TAG_MAP
if not os.path.exists(_TAG_MAP_PATH):
_INSTANCE_TAG_MAP = {}
return _INSTANCE_TAG_MAP
with open(_TAG_MAP_PATH, "r") as f:
_INSTANCE_TAG_MAP = json.load(f)
print(f"Loaded {len(_INSTANCE_TAG_MAP)} instance-to-tag mappings from {_TAG_MAP_PATH}")
return _INSTANCE_TAG_MAP

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Malformed JSON permanently re-loads on every call

If instance_to_tag_mapping.json exists but is malformed, json.load(f) raises JSONDecodeError before the assignment _INSTANCE_TAG_MAP = json.load(f) completes, leaving _INSTANCE_TAG_MAP as None. Every subsequent call re-opens and re-attempts to parse the same broken file. Caching the failure (e.g., setting _INSTANCE_TAG_MAP = {} in an except block and logging a warning) would fall through to the heuristic cleanly for all remaining instances rather than printing repeated parse errors for each one.

Prompt To Fix With AI
This is a comment left during a code review.
Path: swe_bench_pro_eval.py
Line: 77-88

Comment:
**Malformed JSON permanently re-loads on every call**

If `instance_to_tag_mapping.json` exists but is malformed, `json.load(f)` raises `JSONDecodeError` before the assignment `_INSTANCE_TAG_MAP = json.load(f)` completes, leaving `_INSTANCE_TAG_MAP` as `None`. Every subsequent call re-opens and re-attempts to parse the same broken file. Caching the failure (e.g., setting `_INSTANCE_TAG_MAP = {}` in an except block and logging a warning) would fall through to the heuristic cleanly for all remaining instances rather than printing repeated parse errors for each one.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant