Add build-retry, portability, and precomputed image-tag map#103
Add build-retry, portability, and precomputed image-tag map#103jhmblundin wants to merge 1 commit into
Conversation
This change improves resilience and ergonomics of swe_bench_pro_eval.py
without altering the existing eval_results.json schema.
What this changes
- Retry sandbox creation on transient Modal image-build / registry pull
errors (configurable via MAX_BUILD_RETRIES, BUILD_RETRY_DELAY). Prior
behavior: a single flaky pull would fail the instance permanently and
print "RemoteError" for that instance with no recovery.
- Resolve dockerfile paths against the script's own directory via
_REPO_ROOT, so the script can be invoked from any working directory
instead of only the repo root.
- Add helper_code/instance_to_tag_mapping.json, a precomputed
instance_id -> Docker Hub tag mapping (731 entries) used in preference
to the heuristic tag generator. Avoids edge-case mismatches such as
element-hq__element-web's -vnan suffix handling. Falls back to the
existing helper_code.image_uri heuristic if the mapping file is
absent, so callers running without it are unaffected.
- Add setup_dockerfile_commands to modal.Image.from_registry that
best-effort install pip + requests on base images that lack them so
parser.py can import third-party deps. Silently no-ops on images
where pip is already present.
- Pretty-print eval_results.json with indent=2 for human readability.
Backwards compatibility
- eval_results.json schema is unchanged (still {instance_id: bool}).
- Existing call sites of helper_code.image_uri.get_dockerhub_image_uri
are still supported via the fallback path.
- No new required dependencies.
Co-authored-by: Cursor <cursoragent@cursor.com>
| indicators = ( | ||
| "image build", | ||
| "skopeo copy", | ||
| "failed with the exception", | ||
| "remoteerror", | ||
| "image pull", | ||
| "registry", | ||
| ) | ||
| if any(ind in error_str or ind in error_repr for ind in indicators): | ||
| return True | ||
| return "RemoteError" in type(error).__name__ |
There was a problem hiding this comment.
Overly-broad error indicators may swallow legitimate failures
"failed with the exception" and "registry" are generic enough to match a wide class of non-transient errors — for example, an auth failure like "401 Unauthorized: access to registry denied" or any Modal exception whose message contains "failed with the exception". These will be silently retried three times and then discarded as None, making root-cause debugging significantly harder. Consider narrowing the indicators to known-transient patterns (e.g., "skopeo copy", "image pull", "remoteerror") and letting everything else propagate or surface as a distinct failure mode.
Prompt To Fix With AI
This is a comment left during a code review.
Path: swe_bench_pro_eval.py
Line: 331-341
Comment:
**Overly-broad error indicators may swallow legitimate failures**
`"failed with the exception"` and `"registry"` are generic enough to match a wide class of non-transient errors — for example, an auth failure like `"401 Unauthorized: access to registry denied"` or any Modal exception whose message contains "failed with the exception". These will be silently retried three times and then discarded as `None`, making root-cause debugging significantly harder. Consider narrowing the indicators to known-transient patterns (e.g., `"skopeo copy"`, `"image pull"`, `"remoteerror"`) and letting everything else propagate or surface as a distinct failure mode.
How can I resolve this? If you propose a fix, please make it concise.| def _load_instance_tag_map(): | ||
| """Load the instance-to-tag mapping from disk (once). Returns {} if file is absent.""" | ||
| global _INSTANCE_TAG_MAP | ||
| if _INSTANCE_TAG_MAP is not None: | ||
| return _INSTANCE_TAG_MAP | ||
| if not os.path.exists(_TAG_MAP_PATH): | ||
| _INSTANCE_TAG_MAP = {} | ||
| return _INSTANCE_TAG_MAP | ||
| with open(_TAG_MAP_PATH, "r") as f: | ||
| _INSTANCE_TAG_MAP = json.load(f) | ||
| print(f"Loaded {len(_INSTANCE_TAG_MAP)} instance-to-tag mappings from {_TAG_MAP_PATH}") | ||
| return _INSTANCE_TAG_MAP |
There was a problem hiding this comment.
Malformed JSON permanently re-loads on every call
If instance_to_tag_mapping.json exists but is malformed, json.load(f) raises JSONDecodeError before the assignment _INSTANCE_TAG_MAP = json.load(f) completes, leaving _INSTANCE_TAG_MAP as None. Every subsequent call re-opens and re-attempts to parse the same broken file. Caching the failure (e.g., setting _INSTANCE_TAG_MAP = {} in an except block and logging a warning) would fall through to the heuristic cleanly for all remaining instances rather than printing repeated parse errors for each one.
Prompt To Fix With AI
This is a comment left during a code review.
Path: swe_bench_pro_eval.py
Line: 77-88
Comment:
**Malformed JSON permanently re-loads on every call**
If `instance_to_tag_mapping.json` exists but is malformed, `json.load(f)` raises `JSONDecodeError` before the assignment `_INSTANCE_TAG_MAP = json.load(f)` completes, leaving `_INSTANCE_TAG_MAP` as `None`. Every subsequent call re-opens and re-attempts to parse the same broken file. Caching the failure (e.g., setting `_INSTANCE_TAG_MAP = {}` in an except block and logging a warning) would fall through to the heuristic cleanly for all remaining instances rather than printing repeated parse errors for each one.
How can I resolve this? If you propose a fix, please make it concise.
This change improves resilience and ergonomics of swe_bench_pro_eval.py
What this changes
Backwards compatibility
Greptile Summary
This PR improves resilience and ergonomics of
swe_bench_pro_eval.pyby adding a configurable build-retry loop around Modal sandbox creation, resolving Dockerfile paths via_REPO_ROOTso the script works from any directory, and introducing a precomputedinstance_to_tag_mapping.jsonthat takes precedence over the existing heuristic URI generator.modal.Image.from_registry+modal.Sandbox.createin aforloop (up toMAX_BUILD_RETRIES = 3) and usesis_image_build_errorto distinguish transient failures from hard failures; also addssetup_dockerfile_commandsto best-effort installpip+requestson minimal base images._load_instance_tag_maplazily loads 731 precomputedinstance_id → Docker Hub tagentries with a fallback to the heuristic, andeval_results.jsonis now written withindent=2for readability.Confidence Score: 4/5
Safe to merge — the retry and tag-mapping paths degrade gracefully, and the path-portability fix is strictly additive.
The retry loop correctly handles all exit paths (break on success, return None on exhausted/non-transient failures), and the tag-map fallback to the heuristic means a missing or absent JSON file is benign. The two things worth watching: is_image_build_error uses broad indicators like 'registry' and 'failed with the exception' that could silently absorb non-transient errors (auth failures, quota errors, config mistakes), and a malformed instance_to_tag_mapping.json will cause _load_instance_tag_map to re-raise JSONDecodeError on every instance call rather than caching a safe empty-dict fallback.
swe_bench_pro_eval.py — specifically the is_image_build_error indicator list and the _load_instance_tag_map error path.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[eval_with_modal called] --> B[get_dockerhub_image_uri] B --> C{uid in precomputed map?} C -- Yes --> D[return mapping URI] C -- No --> E[heuristic get_dockerhub_image_uri] D & E --> F[Start retry loop\nattempt 1..MAX_BUILD_RETRIES] F --> G[modal.App.lookup] G --> H[modal.Image.from_registry\n+ setup_dockerfile_commands] H --> I[modal.Sandbox.create] I -- success --> J[break — sandbox ready] I -- exception --> K{is_image_build_error?} K -- Yes AND attempts remain --> L[sleep BUILD_RETRY_DELAY\ncontinue loop] L --> F K -- No OR exhausted --> M[return None] J --> N[sandbox.exec entryscript] N --> O[collect outputs] O --> P[return result]Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "Add build-retry, portability, and precom..." | Re-trigger Greptile