An agentified evaluation framework for testing PETSc code generation agents using A2A (Agent-to-Agent) and MCP (Model Context Protocol) standards.
This repository implements a multi-agent benchmark for evaluating code generation agents that produce PETSc (Portable, Extensible Toolkit for Scientific Computation) programs.
Important
π See MOTIVATION.md for the motivation and design rationale behind this project.
Core building blocks:
- A2A Protocol: standardized agent-to-agent communication over HTTP.
- MCP Protocol: tool access for compilation and execution.
- Evaluation pipeline: gates + metrics + LLM-based quality evaluators, aggregated into a composite score and tier.
High-level flow:
- The Green Agent loads benchmark problems from
data/*.json. - For each problem, it asks the Purple Agent to generate PETSc code.
- It compiles and runs the returned code via MCP tools.
- It evaluates results and writes reports to
output/.
Note: Running the benchmark can consume significant LLM tokens depending on the model and number of problems.
The system consists of three components:
-
Green Agent (assessment manager)
- Loads benchmark problems from
data/*.json - Sends each problem description to the Purple Agent via A2A
- Compiles and runs returned code via MCP tools
- Scores results (gates + metrics + quality) and aggregates into a composite score + tier
- Writes reports to
output/
- Loads benchmark problems from
-
Purple Agent (target under test)
- Receives a problem description via A2A
- Uses an LLM to generate PETSc code
- Returns:
- a status text that includes
cli_args - one or more code files
- a status text that includes
-
MCP Server (tool provider)
- Provides compilation and execution tools for PETSc code (used by the Green Agent)
PETSc is an ideal benchmark for evaluating LLM capabilities in scientific computing because it demands:
- Domain expertise: Numerical methods, PDEs, linear algebra, and parallel computing
- Large API surface: 1000+ functions across solvers (TS, SNES, KSP), data structures (Vec, Mat, DM), and optimizers (TAO)
- Correctness and performance: Solutions must be mathematically accurate and computationally efficient
- Parallel programming: MPI, domain decomposition, GPU acceleration (CUDA/HIP)
Unlike toy benchmarks, PETSc code generation tests whether LLMs can produce scientifically valid, performant, and maintainable solutions for real-world HPC applications. See PETSc applications for examples spanning climate modeling, CFD, astrophysics, and more.
At a high level, evaluation is organized into:
- Gates: binary pass/fail checks (e.g., compilation/execution/API usage)
- Metrics: quantitative measurements (e.g., numerical accuracy, execution time)
- Quality: LLM-based qualitative assessment (e.g., code style, algorithm choice, PETSc best practices)
Important
For full details on the evaluation design, scoring, and components, see EVALUATION_SYSTEM_SUMMARY.md.
Benchmark problems are defined as JSON files under data/. The Green Agent loads all JSON files in that directory.
Each problem file is expected to contain (at minimum):
problem_nameproblem_idproblem_description
Current suite (see data/ for full definitions):
- Robertson ODE
- 1D Advection
- Rosenbrock optimization
- Darcy flow
- 2D NavierβStokes
- Vec/MPI tests
gpu_data contains problems that run on GPUs. Since our Github runners do not support GPU at the moment, they are not included in the default setting, but can be activated manually.
Each problem is evaluated across multiple dimensions (see config/green_agent_config.yaml for weights):
- Correctness
- Performance
- Code quality
- Algorithm choice
- PETSc best practices
- Semantic correctness
The Green Agent writes a single file to output/:
output/benchmark_summary.json: overall summary + per-problem results
The Green Agent also emits A2A task artifacts (via TaskUpdater.add_artifact). Depending on your runner/integration, these may be downloadable from logs/UI but are not written to output/ by default:
Codes are assigned to tiers based on composite scores:
- π₯ GOLD (β₯85): Excellent code quality and correctness
- π₯ SILVER (β₯70): Good code with minor issues
- π₯ BRONZE (β₯50): Functional but needs improvement
- β FAIL (<50 or gate failure): Significant issues
βββ data/ # Benchmark problems (JSON files)
βββ config/ # Configuration files
β βββ green_agent_config.yaml # Green agent evaluation + scoring + LLM settings
β βββ purple_agent_config.yaml # Purple agent LLM settings
βββ src/
β βββ client_cli.py # Sends βstart benchmarkβ task to the Green Agent
β βββ launcher.py # Spawns Green/Purple/MCP locally (end-to-end)
β βββ green_agent/ # Assessment manager agent
β βββ purple_agent/ # Target agent under test
β βββ evaluators/ # Gates / metrics / quality evaluators
β βββ metrics/ # Score aggregation + tiering
β βββ util/ # A2A helpers + LLM client
βββ main.py # CLI entry point (green/purple/launch)
βββ pyproject.toml # Python project configuration
βββ output/ # Generated reports and results
βββ purple_agent_cache/ # Cached purple-agent responses (optional)
- PETSc Installation: Install PETSc from https://petsc.org/ for local compilation/execution.
- Python 3.12+: Required (see
pyproject.toml). - uv: Python package manager used by this repo: https://github.com/astral-sh/uv
- Install dependencies using
uv:
uv sync- Create a
.envfile in the root directory with the following variables:
# LLM API Keys
GEMINI_API_KEY="<your_gemini_key>"
OPENAI_API_KEY="<your_openai_key>"
# PETSc Configuration (required for compilation/execution)
PETSC_DIR="<path_to_petsc_installation>"
PETSC_ARCH="<petsc_architecture>" # e.g., arch-darwin-c-debugFor local testing, launch the complete evaluation workflow:
uv run main.py launchThis command will:
- Start the Green Agent (assessment manager)
- Start the Purple Agent (code generator)
- Start the MCP server (compilation/execution tools)
- Run all benchmark problems
- Generate evaluation reports in
output/
You can run the components separately (useful when deploying services on different machines or restarting a single component during development).
# Start only the Green Agent
uv run src/green_agent/server.py
# Start only the Purple Agent
uv run src/purple_agent/petsc_agent.pyFor MCP server deployment, refer to https://gitlab.com/petsc/petsc_mcp_servers
Once the Green Agent, Purple Agent, and MCP server are running, trigger a benchmark run by sending the task message to the Green Agent:
uv run src/client_cli.py --green-url <GREEN_URL> --purple-url <PURPLE_URL> --mcp-server-url <MCP_URL>The system uses separate configuration files for each agent:
config/green_agent_config.yaml- Green agent LLM model and evaluation settingsconfig/purple_agent_config.yaml- Purple agent LLM model settings
Example config/green_agent_config.yaml:
evaluation:
enable_gates: true # Enable binary pass/fail checks
enable_metrics: true # Enable quantitative measurements
enable_quality: true # Enable quality assessments
parallel_evaluation: true # Run evaluators in parallel
llm:
model: "openai/gpt52" # LLM for quality evaluation
api_base_url: "https://apps-dev.inside.anl.gov/argoapi/v1" # Optional API base URL (e.g., Argo/AskSage)
temperature: 0 # Set to 0 only for reproducibility
max_concurrent_calls: 3 # Rate limiting for LLM calls
scoring:
weights:
correctness: 0.35 # Weight for correctness score
performance: 0.15 # Weight for performance metrics
code_quality: 0.15 # Weight for code quality
algorithm: 0.15 # Weight for algorithm choice
petsc: 0.10 # Weight for PETSc best practices
semantic: 0.10 # Weight for semantic correctness
tiers:
gold: 85 # Minimum score for GOLD tier
silver: 70 # Minimum score for SILVER tier
bronze: 50 # Minimum score for BRONZE tierExample config/purple_agent_config.yaml:
llm:
model: "openai/claudeopus45" # LLM for code generation
api_base_url: "https://apps-dev.inside.anl.gov/argoapi/v1" # Optional API base URL (e.g., Argo/AskSage)
temperature: 0 # Set to 0 only for reproducibilityNote:
- Use a LiteLLM-style name, e.g.
<provider_name>/<model_name>. For models provided with an OpenAI-compatible endpoint, useopenaias the provider name. - Leave
api_base_urlnullto use each providerβs default (e.g.https://api.openai.com/v1). Set this to use a custom or proxy endpoint (e.g.https://apps-dev.inside.anl.gov/argoapi/v1for Argo). For OpenAI-compatible APIs the URL should end with/v1; the client will use the appropriate LiteLLM provider prefix. - The system auto-detects AskSage endpoints when
api_base_urlstarts withhttps://api.asksage.anl.govand configures SSL and API keys accordingly.
Evaluators live under src/evaluators/ and are wired into the pipeline in src/evaluators/pipeline.py.
To add a new evaluator:
- Create a class inheriting from
src.evaluators.base.Evaluator - Implement
name,evaluator_type, andevaluate(...) - Add the evaluator to the pipeline
Example:
from src.evaluators.base import Evaluator, EvaluatorType, EvaluationResult
class MyCustomEvaluator(Evaluator):
@property
def name(self) -> str:
return "my_custom_check"
@property
def evaluator_type(self) -> EvaluatorType:
return EvaluatorType.QUALITY
async def evaluate(self, code: str, problem: dict, execution_result: dict | None = None) -> EvaluationResult:
return EvaluationResult(
evaluator_name=self.name,
evaluator_type=self.evaluator_type,
quality_score=0.8,
feedback="Custom evaluation passed",
evaluation_method="deterministic",
confidence=1.0,
)The Green Agent can cache Purple Agent responses (pickled per problem) to speed up development iteration. Cached responses are stored in purple_agent_cache/.
- Wrong Python version: This repo requires Python 3.12+ (see
pyproject.toml). - PETSc not found: Ensure
PETSC_DIRandPETSC_ARCHare set correctly in.env. - LLM API/proxy errors:
- Verify API keys are valid and have sufficient quota.
- If using an OpenAI-compatible proxy (Argo/AskSage), ensure
api_base_urlis set correctly in the relevant config. - For AskSage endpoints, ensure
ASKSAGE_API_KEYandASKSAGE_SSL_CERT_FILEare set.
- Agent connectivity / timeouts:
- Confirm the Green and Purple URLs/ports match your deployment.
- If agents are slow to start, you may need to increase timeouts in
src/util/a2a_comm.py.
- Port conflicts: Modify ports in
src/launcher.pyif defaults are in use (Green9001, Purple9002, MCP8080). - Missing output files: Only
output/benchmark_summary.jsonis written to disk by default; other reports are emitted as task artifacts.
