Skip to content

op12no2/patchwork

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Patchwork

An informal cumulative and competitive frontier model eval using a Javascript chess engine.

Procedure

Assume A is currently the leading engine (initially 0000_original). A model/CLI is selected to improve it by creating a new engine B via prompt.md at max effort. If a B v A SPRT passes, B becomes the new leader and is added to the ratings list via a gauntlet against the previous engines that passed.

Ratings

Rank Engine Elo Games Score Draws
1 0010_fable_5 2241 ±24.4 1600 74.8% 27.0%
2 0009_opus_4_8 2179 ±23.9 1600 67.1% 27.7%
3 0008_opus_4_8 2157 ±24.0 1600 64.2% 27.6%
4 0007_opus_4_7 2137 ±24.1 1600 61.5% 26.6%
5 0006_gpt_5_5 2042 ±23.7 1600 48.1% 24.7%
6 0005_opus_4_7 2014 ±22.9 1600 44.1% 25.2%
7 0003_opus_4_7 2003 ±22.7 1600 42.6% 25.0%
8 0002_sonnet_4_6 1905 ±23.2 1600 29.7% 18.7%
9 0000_original 1800 1600 18.0% 10.5%

SPRT

Engine Model CLI SPRT
0011_grok_4_3 diff xAI Grok 4.3 Grok Build Beta
0010_fable_5 diff Anthropic Claude Fable 5 Claude Code
0009_opus_4_8 diff Anthropic Claude Opus 4.8 Claude Code
0008_opus_4_8 diff Anthropic Claude Opus 4.8 Claude Code
0007_opus_4_7 diff Anthropic Claude Opus 4.7 Claude Code
0006_gpt_5_5 diff OpenAI GPT 5.5 Codex
0005_opus_4_7 diff Anthropic Claude Opus 4.7 Claude Code
0004_gpt_5_5 diff OpenAI GPT 5.5 Codex
0003_opus_4_7 diff Anthropic Claude Opus 4.7 Claude Code
0002_sonnet_4_6 diff Anthropic Claude Sonnet 4.6 Claude Code
0001_haiku_4_5 diff Anthropic Claude Haiku 4.5 Claude Code
0000_original

Acknowledgements

About

An informal cumulative and competitive frontier model eval using a Javascript chess engine

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors