Add build-retry, portability, and precomputed image-tag map by jhmblundin · Pull Request #103 · scaleapi/SWE-bench_Pro-os

jhmblundin · 2026-06-03T00:59:29Z

This change improves resilience and ergonomics of swe_bench_pro_eval.py

What this changes

Retry sandbox creation on transient Modal image-build / registry pull errors (configurable via MAX_BUILD_RETRIES, BUILD_RETRY_DELAY). Prior behavior: a single flaky pull would fail the instance permanently and print "RemoteError" for that instance with no recovery.
Resolve dockerfile paths against the script's own directory via _REPO_ROOT, so the script can be invoked from any working directory instead of only the repo root.
Add helper_code/instance_to_tag_mapping.json, a precomputed instance_id -> Docker Hub tag mapping (731 entries) used in preference to the heuristic tag generator. Avoids edge-case mismatches such as element-hq__element-web's -vnan suffix handling. Falls back to the existing helper_code.image_uri heuristic if the mapping file is absent, so callers running without it are unaffected.
Add setup_dockerfile_commands to modal.Image.from_registry that best-effort install pip + requests on base images that lack them so parser.py can import third-party deps. Silently no-ops on images where pip is already present.
Pretty-print eval_results.json with indent=2 for human readability.

Backwards compatibility

eval_results.json schema is unchanged (still {instance_id: bool}).
Existing call sites of helper_code.image_uri.get_dockerhub_image_uri are still supported via the fallback path.
No new required dependencies.

Greptile Summary

This PR improves resilience and ergonomics of swe_bench_pro_eval.py by adding a configurable build-retry loop around Modal sandbox creation, resolving Dockerfile paths via _REPO_ROOT so the script works from any directory, and introducing a precomputed instance_to_tag_mapping.json that takes precedence over the existing heuristic URI generator.

Retry logic: wraps modal.Image.from_registry + modal.Sandbox.create in a for loop (up to MAX_BUILD_RETRIES = 3) and uses is_image_build_error to distinguish transient failures from hard failures; also adds setup_dockerfile_commands to best-effort install pip + requests on minimal base images.
Tag mapping: _load_instance_tag_map lazily loads 731 precomputed instance_id → Docker Hub tag entries with a fallback to the heuristic, and eval_results.json is now written with indent=2 for readability.

Confidence Score: 4/5

Safe to merge — the retry and tag-mapping paths degrade gracefully, and the path-portability fix is strictly additive.

The retry loop correctly handles all exit paths (break on success, return None on exhausted/non-transient failures), and the tag-map fallback to the heuristic means a missing or absent JSON file is benign. The two things worth watching: is_image_build_error uses broad indicators like 'registry' and 'failed with the exception' that could silently absorb non-transient errors (auth failures, quota errors, config mistakes), and a malformed instance_to_tag_mapping.json will cause _load_instance_tag_map to re-raise JSONDecodeError on every instance call rather than caching a safe empty-dict fallback.

swe_bench_pro_eval.py — specifically the is_image_build_error indicator list and the _load_instance_tag_map error path.

Important Files Changed

Filename	Overview
swe_bench_pro_eval.py	Adds build-retry loop, _REPO_ROOT portability fix, precomputed tag-map lookup, setup_dockerfile_commands for pip, and indent=2 on eval_results.json; retry and error-detection logic is mostly correct but the is_image_build_error indicators are broad enough to silently absorb non-transient failures, and a malformed JSON mapping file would repeatedly re-trigger parse errors on every instance.
helper_code/instance_to_tag_mapping.json	New precomputed instance_id → Docker Hub tag mapping with 731 entries; data appears well-formed and the code treats this as an optional hint with a heuristic fallback.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[eval_with_modal called] --> B[get_dockerhub_image_uri]
    B --> C{uid in precomputed map?}
    C -- Yes --> D[return mapping URI]
    C -- No --> E[heuristic get_dockerhub_image_uri]
    D & E --> F[Start retry loop\nattempt 1..MAX_BUILD_RETRIES]
    F --> G[modal.App.lookup]
    G --> H[modal.Image.from_registry\n+ setup_dockerfile_commands]
    H --> I[modal.Sandbox.create]
    I -- success --> J[break — sandbox ready]
    I -- exception --> K{is_image_build_error?}
    K -- Yes AND attempts remain --> L[sleep BUILD_RETRY_DELAY\ncontinue loop]
    L --> F
    K -- No OR exhausted --> M[return None]
    J --> N[sandbox.exec entryscript]
    N --> O[collect outputs]
    O --> P[return result]

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
swe_bench_pro_eval.py:331-341
**Overly-broad error indicators may swallow legitimate failures**

`"failed with the exception"` and `"registry"` are generic enough to match a wide class of non-transient errors — for example, an auth failure like `"401 Unauthorized: access to registry denied"` or any Modal exception whose message contains "failed with the exception". These will be silently retried three times and then discarded as `None`, making root-cause debugging significantly harder. Consider narrowing the indicators to known-transient patterns (e.g., `"skopeo copy"`, `"image pull"`, `"remoteerror"`) and letting everything else propagate or surface as a distinct failure mode.

### Issue 2 of 3
swe_bench_pro_eval.py:77-88
**Malformed JSON permanently re-loads on every call**

If `instance_to_tag_mapping.json` exists but is malformed, `json.load(f)` raises `JSONDecodeError` before the assignment `_INSTANCE_TAG_MAP = json.load(f)` completes, leaving `_INSTANCE_TAG_MAP` as `None`. Every subsequent call re-opens and re-attempts to parse the same broken file. Caching the failure (e.g., setting `_INSTANCE_TAG_MAP = {}` in an except block and logging a warning) would fall through to the heuristic cleanly for all remaining instances rather than printing repeated parse errors for each one.

### Issue 3 of 3
swe_bench_pro_eval.py:370-373
`modal.App.lookup` is called on every retry iteration even though the result is always the same app object. Moving it outside the loop saves redundant API round-trips on retried attempts.

```suggestion
        app = modal.App.lookup(name="swe-bench-pro-eval", create_if_missing=True)
        for attempt in range(1, MAX_BUILD_RETRIES + 1):
            try:
                image = modal.Image.from_registry(
```

_{Reviews (1): Last reviewed commit: "Add build-retry, portability, and precom..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

This change improves resilience and ergonomics of swe_bench_pro_eval.py without altering the existing eval_results.json schema. What this changes - Retry sandbox creation on transient Modal image-build / registry pull errors (configurable via MAX_BUILD_RETRIES, BUILD_RETRY_DELAY). Prior behavior: a single flaky pull would fail the instance permanently and print "RemoteError" for that instance with no recovery. - Resolve dockerfile paths against the script's own directory via _REPO_ROOT, so the script can be invoked from any working directory instead of only the repo root. - Add helper_code/instance_to_tag_mapping.json, a precomputed instance_id -> Docker Hub tag mapping (731 entries) used in preference to the heuristic tag generator. Avoids edge-case mismatches such as element-hq__element-web's -vnan suffix handling. Falls back to the existing helper_code.image_uri heuristic if the mapping file is absent, so callers running without it are unaffected. - Add setup_dockerfile_commands to modal.Image.from_registry that best-effort install pip + requests on base images that lack them so parser.py can import third-party deps. Silently no-ops on images where pip is already present. - Pretty-print eval_results.json with indent=2 for human readability. Backwards compatibility - eval_results.json schema is unchanged (still {instance_id: bool}). - Existing call sites of helper_code.image_uri.get_dockerhub_image_uri are still supported via the fallback path. - No new required dependencies. Co-authored-by: Cursor <cursoragent@cursor.com>

greptile-apps · 2026-06-03T01:03:13Z

+    indicators = (
+        "image build",
+        "skopeo copy",
+        "failed with the exception",
+        "remoteerror",
+        "image pull",
+        "registry",
+    )
+    if any(ind in error_str or ind in error_repr for ind in indicators):
+        return True
+    return "RemoteError" in type(error).__name__


Overly-broad error indicators may swallow legitimate failures

"failed with the exception" and "registry" are generic enough to match a wide class of non-transient errors — for example, an auth failure like "401 Unauthorized: access to registry denied" or any Modal exception whose message contains "failed with the exception". These will be silently retried three times and then discarded as None, making root-cause debugging significantly harder. Consider narrowing the indicators to known-transient patterns (e.g., "skopeo copy", "image pull", "remoteerror") and letting everything else propagate or surface as a distinct failure mode.

Prompt To Fix With AI

This is a comment left during a code review. Path: swe_bench_pro_eval.py Line: 331-341 Comment: **Overly-broad error indicators may swallow legitimate failures** `"failed with the exception"` and `"registry"` are generic enough to match a wide class of non-transient errors — for example, an auth failure like `"401 Unauthorized: access to registry denied"` or any Modal exception whose message contains "failed with the exception". These will be silently retried three times and then discarded as `None`, making root-cause debugging significantly harder. Consider narrowing the indicators to known-transient patterns (e.g., `"skopeo copy"`, `"image pull"`, `"remoteerror"`) and letting everything else propagate or surface as a distinct failure mode. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-03T01:03:14Z

+def _load_instance_tag_map():
+    """Load the instance-to-tag mapping from disk (once). Returns {} if file is absent."""
+    global _INSTANCE_TAG_MAP
+    if _INSTANCE_TAG_MAP is not None:
+        return _INSTANCE_TAG_MAP
+    if not os.path.exists(_TAG_MAP_PATH):
+        _INSTANCE_TAG_MAP = {}
+        return _INSTANCE_TAG_MAP
+    with open(_TAG_MAP_PATH, "r") as f:
+        _INSTANCE_TAG_MAP = json.load(f)
+    print(f"Loaded {len(_INSTANCE_TAG_MAP)} instance-to-tag mappings from {_TAG_MAP_PATH}")
+    return _INSTANCE_TAG_MAP


Malformed JSON permanently re-loads on every call

If instance_to_tag_mapping.json exists but is malformed, json.load(f) raises JSONDecodeError before the assignment _INSTANCE_TAG_MAP = json.load(f) completes, leaving _INSTANCE_TAG_MAP as None. Every subsequent call re-opens and re-attempts to parse the same broken file. Caching the failure (e.g., setting _INSTANCE_TAG_MAP = {} in an except block and logging a warning) would fall through to the heuristic cleanly for all remaining instances rather than printing repeated parse errors for each one.

Prompt To Fix With AI

This is a comment left during a code review. Path: swe_bench_pro_eval.py Line: 77-88 Comment: **Malformed JSON permanently re-loads on every call** If `instance_to_tag_mapping.json` exists but is malformed, `json.load(f)` raises `JSONDecodeError` before the assignment `_INSTANCE_TAG_MAP = json.load(f)` completes, leaving `_INSTANCE_TAG_MAP` as `None`. Every subsequent call re-opens and re-attempts to parse the same broken file. Caching the failure (e.g., setting `_INSTANCE_TAG_MAP = {}` in an except block and logging a warning) would fall through to the heuristic cleanly for all remaining instances rather than printing repeated parse errors for each one. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add build-retry, portability, and precomputed image-tag map#103

Add build-retry, portability, and precomputed image-tag map#103
jhmblundin wants to merge 1 commit into
scaleapi:mainfrom
blitzy-showcase:upstream-contrib/build-retry-and-portability

jhmblundin commented Jun 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

greptile-apps Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jhmblundin commented Jun 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jhmblundin commented Jun 3, 2026 •

edited by greptile-apps Bot

Loading