An informal cumulative and competitive frontier model eval using a Javascript chess engine.
Assume A is currently the leading engine (initially 0000_original). A model/CLI is selected to improve it by creating a new engine B via prompt.md at max effort. If a B v A SPRT passes, B becomes the new leader and is added to the ratings list via a gauntlet against the previous engines that passed.
| Rank | Engine | Elo | Games | Score | Draws |
|---|---|---|---|---|---|
| 1 | 0010_fable_5 | 2241 ±24.4 | 1600 | 74.8% | 27.0% |
| 2 | 0009_opus_4_8 | 2179 ±23.9 | 1600 | 67.1% | 27.7% |
| 3 | 0008_opus_4_8 | 2157 ±24.0 | 1600 | 64.2% | 27.6% |
| 4 | 0007_opus_4_7 | 2137 ±24.1 | 1600 | 61.5% | 26.6% |
| 5 | 0006_gpt_5_5 | 2042 ±23.7 | 1600 | 48.1% | 24.7% |
| 6 | 0005_opus_4_7 | 2014 ±22.9 | 1600 | 44.1% | 25.2% |
| 7 | 0003_opus_4_7 | 2003 ±22.7 | 1600 | 42.6% | 25.0% |
| 8 | 0002_sonnet_4_6 | 1905 ±23.2 | 1600 | 29.7% | 18.7% |
| 9 | 0000_original | 1800 | 1600 | 18.0% | 10.5% |
| Engine | Model | CLI | SPRT | |
|---|---|---|---|---|
| 0011_grok_4_3 | diff | xAI Grok 4.3 | Grok Build Beta | ✗ |
| 0010_fable_5 | diff | Anthropic Claude Fable 5 | Claude Code | ✓ |
| 0009_opus_4_8 | diff | Anthropic Claude Opus 4.8 | Claude Code | ✓ |
| 0008_opus_4_8 | diff | Anthropic Claude Opus 4.8 | Claude Code | ✓ |
| 0007_opus_4_7 | diff | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0006_gpt_5_5 | diff | OpenAI GPT 5.5 | Codex | ✓ |
| 0005_opus_4_7 | diff | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0004_gpt_5_5 | diff | OpenAI GPT 5.5 | Codex | ✗ |
| 0003_opus_4_7 | diff | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0002_sonnet_4_6 | diff | Anthropic Claude Sonnet 4.6 | Claude Code | ✓ |
| 0001_haiku_4_5 | diff | Anthropic Claude Haiku 4.5 | Claude Code | ✗ |
| 0000_original |
- https://github.com/Disservin/fastchess - SPRT and tournament manager
- https://github.com/michiguel/Ordo - Elo rating calculation