{"id":"10386c54-472e-439b-b138-910cd62d3a0f","shortId":"dHcrxa","kind":"skill","title":"rl-reward","tagline":"Build RL reward signals using the OpenJudge framework. Covers choosing between pointwise and pairwise reward strategies based on RL algorithm, task type, and cost; aggregating multi-dimensional pointwise scores into a scalar reward; pairwise tournament reward for GRPO on subjecti","description":"# RL Reward Construction with OpenJudge\n\nBuild reward signals for reinforcement learning from human feedback (RLHF) and\nreinforcement learning from AI feedback (RLAIF) using the `openjudge` library.\n\n## When to Use This Skill\n\n- Building scalar rewards for GRPO / REINFORCE rollout scoring\n- Generating (chosen, rejected) preference pairs for DPO / IPO\n- Best-of-N candidate selection\n- Multi-dimensional reward shaping (correctness + safety + format)\n- Replacing or bootstrapping a reward model with LLM-as-judge\n\n## Step 1 — Choose Your Reward Strategy\n\nUse this decision tree **before** writing any code:\n\n```\nRL Algorithm + Task type?\n│\n├── GRPO / REINFORCE — Verifiable task (math, code, structured output)\n│   └── → POINTWISE  ✅  (FunctionGrader, exact score, zero LLM cost)\n│\n├── GRPO / REINFORCE — Subjective task (instruction following, dialogue, summarization)\n│   └── → PAIRWISE TOURNAMENT  ✅  (compare each rollout vs all others in group,\n│                                    reward = net win rate within group)\n│\n├── DPO / IPO / SLiC — need (chosen, rejected) pairs\n│   └── → PAIRWISE  ✅  (two-way comparison, return winner/loser)\n│\n└── Best-of-N / reranking — rank N candidates\n    └── → LISTWISE  ✅  (single call ranks all N at once)\n```\n\n```\nCost constraint?\n├── Low budget\n│   └── FunctionGrader (free) → pointwise; or pairwise with small judge model\n│\n├── Medium budget\n│   └── Pointwise: 2–3 LLM graders + WeightedSumAggregator\n│   └── Pairwise tournament: 1 LLM judge, N*(N-1)/2 comparisons per group\n│\n└── High quality / no cost limit\n    └── Pointwise voting (3–5 calls) or pairwise with strong judge + debiasing\n```\n\n## Sub-documents — Read When Relevant\n\n| Topic | File | Read when… |\n|-------|------|------------|\n| Pointwise multi-dim reward | `pointwise.md` | GRPO on verifiable tasks; multi-dimension scoring |\n| Pairwise reward | `pairwise.md` | GRPO on subjective tasks (tournament); DPO/RLAIF preference pairs |\n\nRead the relevant sub-document **before** writing any code.\n\n## Install\n\n```bash\npip install py-openjudge\n```\n\n## Strategy Comparison\n\n| Strategy | Output | Reward signal | Typical use | Cost |\n|----------|--------|---------------|-------------|------|\n| **Pointwise** | scalar per response | direct reward `r(x, y)` | GRPO on verifiable tasks, filtering | Low–Medium |\n| **Pairwise Tournament** | net win rate per response | relative reward within group | GRPO on subjective tasks | Medium (N²/2 calls) |\n| **Pairwise** | winner/loser pair | implicit preference `y+ > y-` | DPO, IPO, RLAIF preference data | Medium |\n| **Listwise** | rank over N responses | ordinal reward / reranking | Best-of-N, reranking | Medium–High |\n\n## Score Normalization\n\nAll graders return scores on different scales. **Always normalize** before feeding into RL:\n\n```python\ndef normalize(score: float, min_score: float, max_score: float) -> float:\n    \"\"\"Map [min_score, max_score] → [0.0, 1.0].\"\"\"\n    if max_score == min_score:\n        return 0.0\n    return (score - min_score) / (max_score - min_score)\n\n# LLM graders (common/*) return 1–5 → normalize to 0–1\nreward = normalize(result.score, min_score=1, max_score=5)\n\n# FunctionGrader / text graders already return 0–1 → no normalization needed\n```\n\n## Evaluation Strategies\n\nEvaluation strategies control **how many times** a grader is called and **how\nresults are aggregated**. They are independent of the grader itself.\n\n### Choose Your Strategy\n\n```\nGrader type?\n│\n├── Deterministic (FunctionGrader, StringMatch, CodeExecution, etc.)\n│   └── → Direct  (zero variance, no need for aggregation)\n│\n├── LLM grader — Pointwise scoring\n│   │\n│   ├── Budget limited / speed critical\n│   │   └── → Direct  (accept variance, 1× cost)\n│   │\n│   ├── Discrete scores (1–5 integer, pass/fail, binary)\n│   │   └── → Voting  (majority vote, robust to outliers, N× cost)\n│   │\n│   └── Continuous / fine-grained scores (need precise ranking)\n│       └── → Average  (mean, preserves signal, N× cost)\n│\n└── LLM grader — Pairwise GRPO tournament\n    └── → GRPOTournament  (all-pairs comparison, net win rate)\n```\n\n| Strategy | Aggregation | Best for | Cost |\n|----------|-------------|----------|------|\n| `DirectEvaluationStrategy` | None | Deterministic graders; low budget | 1× |\n| `VotingEvaluationStrategy` | Majority vote | Discrete / integer LLM scores | N× |\n| `AverageEvaluationStrategy` | Mean | Continuous LLM scores | N× |\n| `GRPOTournamentEvaluationStrategy` | Net win rate | Pairwise GRPO on subjective tasks | N²/2× |\n\nAll strategies are imported from `openjudge.evaluation_strategy`.\n\n### Pointwise — Noise Reduction with Voting / Average\n\nFor high-variance LLM judges, wrap any grader with `VotingEvaluationStrategy`\nto run N calls and take the majority vote:\n\n```python\nfrom openjudge.evaluation_strategy import VotingEvaluationStrategy\n\ngrader = CorrectnessGrader(\n    model=model,\n    strategy=VotingEvaluationStrategy(num_votes=3, tie_breaker=\"closest_to_mean\"),\n)\n# Now each call internally runs 3 LLM evaluations and returns the most common score\n```\n\nUse odd `num_votes` (3, 5) to avoid ties.\n\n### Pairwise — GRPO Tournament\n\nFor GRPO on subjective tasks, use `GRPOTournamentEvaluationStrategy` to run\nall-pairs comparison and compute net win rate per rollout:\n\n```python\nfrom openjudge.evaluation_strategy import GRPOTournamentEvaluationStrategy\n\nstrategy = GRPOTournamentEvaluationStrategy(debiased=False)\nresults = await strategy.execute(\n    pairwise_grader.aevaluate,\n    query=\"Write a haiku about the ocean.\",\n    responses=[\"rollout_1\", \"rollout_2\", \"rollout_3\", \"rollout_4\"],\n)\nrewards = [r.score for r in results]  # net win rates in [-1.0, 1.0]\n```\n\nSet `debiased=True` to run each pair in both orders and only count consistent\nresults (doubles LLM calls but mitigates position bias).","tags":["reward","openjudge","agentscope-ai","agent","agent-skills","ai-agent","alignment","evaluation","grader","llm","reward-model","rlhf"],"capabilities":["skill","source-agentscope-ai","skill-rl-reward","topic-agent","topic-agent-skills","topic-ai-agent","topic-alignment","topic-evaluation","topic-grader","topic-llm","topic-reward","topic-reward-model","topic-rlhf","topic-skill-md","topic-skills"],"categories":["OpenJudge"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/agentscope-ai/OpenJudge/rl-reward","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add agentscope-ai/OpenJudge","source_repo":"https://github.com/agentscope-ai/OpenJudge","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 585 github stars · SKILL.md body (5,780 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-02T18:53:08.577Z","embedding":null,"createdAt":"2026-04-18T21:57:31.399Z","updatedAt":"2026-05-02T18:53:08.577Z","lastSeenAt":"2026-05-02T18:53:08.577Z","tsv":"'-1':232 '-1.0':727 '/2':233,347,587 '0':434,450 '0.0':409,417 '1':118,227,430,435,441,451,507,511,562,710 '1.0':410,728 '2':220,712 '3':221,244,635,646,659,714 '4':716 '5':245,431,444,512,660 'accept':505 'aggreg':28,471,495,552 'ai':64 'algorithm':23,132 'all-pair':544,676 'alreadi':448 'alway':386 'averag':532,600 'averageevaluationstrategi':571 'avoid':662 'await':698 'base':20 'bash':299 'best':93,189,371,553 'best-of-n':92,188,370 'bias':750 'binari':515 'bootstrap':108 'breaker':637 'budget':207,218,500,561 'build':4,50,76 'call':198,246,348,466,615,643,746 'candid':96,195 'choos':13,119,479 'chosen':85,178 'closest':638 'code':130,140,297 'codeexecut':487 'common':428,653 'compar':160 'comparison':185,234,306,547,679 'comput':681 'consist':742 'constraint':205 'construct':47 'continu':524,573 'control':459 'correct':103 'correctnessgrad':628 'cost':27,149,204,240,313,508,523,537,555 'count':741 'cover':12 'critic':503 'data':360 'debias':252,695,730 'decis':125 'def':393 'determinist':484,558 'dialogu':156 'differ':384 'dim':266 'dimens':275 'dimension':31,100 'direct':318,489,504 'directevaluationstrategi':556 'discret':509,566 'document':255,293 'doubl':744 'dpo':90,174,356 'dpo/rlaif':285 'etc':488 'evalu':455,457,648 'exact':145 'fals':696 'feed':389 'feedback':58,65 'file':260 'filter':327 'fine':526 'fine-grain':525 'float':396,399,402,403 'follow':155 'format':105 'framework':11 'free':209 'functiongrad':144,208,445,485 'generat':84 'grader':223,380,427,447,464,477,482,497,539,559,609,627 'grain':527 'group':167,173,236,340 'grpo':42,80,135,150,269,280,323,341,541,582,665,668 'grpotourna':543 'grpotournamentevaluationstrategi':577,673,692,694 'haiku':704 'high':237,376,603 'high-vari':602 'human':57 'implicit':352 'import':591,625,691 'independ':474 'instal':298,301 'instruct':154 'integ':513,567 'intern':644 'ipo':91,175,357 'judg':116,215,229,251,606 'learn':55,62 'librari':70 'limit':241,501 'listwis':196,362 'llm':114,148,222,228,426,496,538,568,574,605,647,745 'llm-as-judg':113 'low':206,328,560 'major':517,564,619 'mani':461 'map':404 'math':139 'max':400,407,412,422,442 'mean':533,572,640 'medium':217,329,345,361,375 'min':397,405,414,420,424,439 'mitig':748 'model':111,216,629,630 'multi':30,99,265,274 'multi-dim':264 'multi-dimens':273 'multi-dimension':29,98 'n':95,191,194,201,230,231,346,365,373,522,536,570,576,586,614 'need':177,454,493,529 'net':169,332,548,578,682,723 'nois':596 'none':557 'normal':378,387,394,432,437,453 'num':633,657 'ocean':707 'odd':656 'openjudg':10,49,69,304 'openjudge.evaluation':593,623,689 'order':738 'ordin':367 'other':165 'outlier':521 'output':142,308 'pair':88,180,287,351,546,678,735 'pairwis':17,38,158,181,212,225,248,277,330,349,540,581,664 'pairwise.md':279 'pairwise_grader.aevaluate':700 'pass/fail':514 'per':235,316,335,685 'pip':300 'pointwis':15,32,143,210,219,242,263,314,498,595 'pointwise.md':268 'posit':749 'precis':530 'prefer':87,286,353,359 'preserv':534 'py':303 'py-openjudg':302 'python':392,621,687 'qualiti':238 'queri':701 'r':320,720 'r.score':718 'rank':193,199,363,531 'rate':171,334,550,580,684,725 'read':256,261,288 'reduct':597 'reinforc':54,61,81,136,151 'reject':86,179 'relat':337 'relev':258,290 'replac':106 'rerank':192,369,374 'respons':317,336,366,708 'result':469,697,722,743 'result.score':438 'return':186,381,416,418,429,449,650 'reward':3,6,18,37,40,46,51,78,101,110,121,168,267,278,309,319,338,368,436,717 'rl':2,5,22,45,131,391 'rl-reward':1 'rlaif':66,358 'rlhf':59 'robust':519 'rollout':82,162,686,709,711,713,715 'run':613,645,675,733 'safeti':104 'scalar':36,77,315 'scale':385 'score':33,83,146,276,377,382,395,398,401,406,408,413,415,419,421,423,425,440,443,499,510,528,569,575,654 'select':97 'set':729 'shape':102 'signal':7,52,310,535 'singl':197 'skill':75 'skill-rl-reward' 'slic':176 'small':214 'source-agentscope-ai' 'speed':502 'step':117 'strategi':19,122,305,307,456,458,481,551,589,594,624,631,690,693 'strategy.execute':699 'stringmatch':486 'strong':250 'structur':141 'sub':254,292 'sub-docu':253,291 'subject':152,282,343,584,670 'subjecti':44 'summar':157 'take':617 'task':24,133,138,153,272,283,326,344,585,671 'text':446 'tie':636,663 'time':462 'topic':259 'topic-agent' 'topic-agent-skills' 'topic-ai-agent' 'topic-alignment' 'topic-evaluation' 'topic-grader' 'topic-llm' 'topic-reward' 'topic-reward-model' 'topic-rlhf' 'topic-skill-md' 'topic-skills' 'tournament':39,159,226,284,331,542,666 'tree':126 'true':731 'two':183 'two-way':182 'type':25,134,483 'typic':311 'use':8,67,73,123,312,655,672 'varianc':491,506,604 'verifi':137,271,325 'vote':243,516,518,565,599,620,634,658 'votingevaluationstrategi':563,611,626,632 'vs':163 'way':184 'weightedsumaggreg':224 'win':170,333,549,579,683,724 'winner/loser':187,350 'within':172,339 'wrap':607 'write':128,295,702 'x':321 'y':322,354,355 'zero':147,490","prices":[{"id":"bdbce3f3-4c2a-455b-b1c8-f53dcdd765c9","listingId":"10386c54-472e-439b-b138-910cd62d3a0f","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"agentscope-ai","category":"OpenJudge","install_from":"skills.sh"},"createdAt":"2026-04-18T21:57:31.399Z"}],"sources":[{"listingId":"10386c54-472e-439b-b138-910cd62d3a0f","source":"github","sourceId":"agentscope-ai/OpenJudge/rl-reward","sourceUrl":"https://github.com/agentscope-ai/OpenJudge/tree/main/skills/rl-reward","isPrimary":false,"firstSeenAt":"2026-04-18T21:57:31.399Z","lastSeenAt":"2026-05-02T18:53:08.577Z"}],"details":{"listingId":"10386c54-472e-439b-b138-910cd62d3a0f","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"agentscope-ai","slug":"rl-reward","github":{"repo":"agentscope-ai/OpenJudge","stars":585,"topics":["agent","agent-skills","ai-agent","alignment","evaluation","grader","llm","reward","reward-model","rlhf","skill-md","skills"],"license":"apache-2.0","html_url":"https://github.com/agentscope-ai/OpenJudge","pushed_at":"2026-04-30T08:18:46Z","description":"OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards","skill_md_sha":"d3da6f5f63cb891f6e9e465fea863e043ff59836","skill_md_path":"skills/rl-reward/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/agentscope-ai/OpenJudge/tree/main/skills/rl-reward"},"layout":"multi","source":"github","category":"OpenJudge","frontmatter":{"name":"rl-reward","description":"Build RL reward signals using the OpenJudge framework. Covers choosing between pointwise and pairwise reward strategies based on RL algorithm, task type, and cost; aggregating multi-dimensional pointwise scores into a scalar reward; pairwise tournament reward for GRPO on subjective tasks (net win rate across group rollouts); generating preference pairs for DPO/RLAIF; and normalizing scores for training stability. Use when building reward models, scoring rollouts for GRPO/REINFORCE, generating preference data for DPO, or doing Best-of-N selection."},"skills_sh_url":"https://skills.sh/agentscope-ai/OpenJudge/rl-reward"},"updatedAt":"2026-05-02T18:53:08.577Z"}}