ForgeJudge
Open evaluation leaderboard and CI gate for autonomous coding agents with sandboxed execution and public traces.
What it does
Open evaluation leaderboard and CI gate for autonomous coding agents with sandboxed execution and public traces.
ForgeJudge is an open-source evaluation platform for autonomous coding agents. It runs every patch in an isolated sandbox, grades results using a deterministic SWE-bench-based harness against a curated golden test set, and publishes full OpenTelemetry traces publicly. A multi-seed regression gate prevents performance degradation across agent versions, making ForgeJudge a reliable CI gate for teams building LLM-powered coding tools.
Capabilities
Server
Quality
deterministic score 0.60 from registry signals: · indexed on pulsemcp · has source repo · registry-generated description present