{"id":"77438244-03c6-4360-aa4a-4536f4a691e3","shortId":"GXTe5s","kind":"skill","title":"autoresearch","tagline":"This skill should be used when the user asks to \"run autoresearch\", \"optimize X in a loop\", \"set up autonomous experiments\", \"start autoresearch\", \"optimize X overnight\", or \"experiment loop\". Sets up and runs an autonomous experiment loop for any optimization target.","description":"# Autoresearch\n\nAutonomous experiment loop: try ideas, measure results, keep what works, discard what doesn't, never stop.\n\nWorks for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores, binary size, latency, memory usage.\n\n## Setup\n\nIf `autoresearch.md` already exists in the working directory, **skip setup and resume the loop** — read `autoresearch.md`, `autoresearch.jsonl`, and `git log`, then continue experimenting.\n\nOtherwise:\n\n1. **Gather context**: Ask (or infer from `$ARGUMENTS` and conversation) the **Goal**, **Command** to benchmark, **Primary metric** (name + direction), **Files in scope**, and **Constraints**.\n2. **Create branch**: `git checkout -b autoresearch/<goal>-<date>` (e.g. `autoresearch/test-speed-2026-03-21`).\n3. **Read source files**: Understand the workload deeply before writing anything. Read every file in scope.\n4. **Write session files**: Create `autoresearch.md` and `autoresearch.sh` (see templates below). If constraints require correctness validation (tests must pass, types must check), also create `autoresearch.checks.sh`. Commit all.\n5. **Run baseline**: Execute the first experiment with no changes to establish the baseline metric.\n6. **Start looping**: Begin the experiment loop immediately after the baseline is logged.\n\n### `autoresearch.md`\n\nThe heart of the session. A fresh agent with no context should be able to read this file alone and run the loop effectively. Invest time making it excellent.\n\n```markdown\n# Autoresearch: <goal>\n\n## Objective\n<Specific description of what we're optimizing and the workload.>\n\n## Metrics\n- **Primary**: <name> (<unit>, lower/higher is better)\n- **Secondary**: <name>, <name>, ...\n\n## How to Run\n`./autoresearch.sh` — outputs `METRIC name=value` lines.\n\n## Files in Scope\n<Every file the agent may modify, with a brief note on what it does.>\n\n## Off Limits\n<What must NOT be touched — evaluation harness, data prep, etc.>\n\n## Constraints\n<Hard rules: tests must pass, no new deps, fixed time budget, etc.>\n\n## What's Been Tried\n<Update this section as experiments accumulate. Note key wins, dead ends,\nand architectural insights so the agent doesn't repeat failed approaches.>\n```\n\nUpdate `autoresearch.md` periodically — especially \"What's Been Tried\" — so resuming agents have full context.\n\n### `autoresearch.sh`\n\nBash script that runs the benchmark and outputs structured metrics.\n\n```bash\n#!/bin/bash\nset -euo pipefail\n\n# Pre-checks (fast, <1s — catch syntax errors early)\npython3 -c \"import ast; ast.parse(open('train.py').read())\"\n\n# Run benchmark\nuv run train.py > /tmp/autoresearch-output.log 2>&1\n\n# Extract and output metrics as METRIC lines\nval_bpb=$(grep \"^val_bpb:\" /tmp/autoresearch-output.log | awk '{print $2}')\necho \"METRIC val_bpb=$val_bpb\"\n```\n\nRules:\n- Use `set -euo pipefail`.\n- Output `METRIC name=value` lines to stdout (one per metric). The primary metric name must match what's documented in `autoresearch.md`.\n- Metric names: word chars, dots, or `µ` (e.g. `val_bpb`, `total_µs`, `bundle.size_kb`).\n- Keep the script fast — every second is multiplied by hundreds of runs.\n- For fast/noisy benchmarks (<5s), run multiple times inside the script and report the median.\n- Update the script during the loop as needed.\n\n### `autoresearch.checks.sh` (optional)\n\nBackpressure checks: tests, types, lint. **Only create when constraints require correctness validation.**\n\n```bash\n#!/bin/bash\nset -euo pipefail\npnpm test --run --reporter=dot 2>&1 | tail -50\npnpm typecheck 2>&1 | grep -i error || true\n```\n\nWhen this file exists:\n- Run it after every **passing** benchmark (exit 0).\n- If checks fail, log the experiment as `checks_failed` and revert.\n- Check execution time does NOT affect the primary metric.\n- Keep output minimal — suppress verbose progress, only show errors.\n\nWhen this file does not exist, skip checks entirely.\n\n## The Experiment Loop\n\n**LOOP FOREVER.** Never ask \"should I continue?\" — the user expects autonomous work.\n\nEach iteration:\n\n1. **Formulate hypothesis**: Based on prior results, source code understanding, and any ideas in `autoresearch.ideas.md`, choose what to try next.\n2. **Edit code**: Modify the in-scope files. Make a single, focused change per experiment.\n3. **Commit**: `git add -A && git commit -m \"<short description of what this experiment tries>\"`\n4. **Run benchmark**:\n   ```bash\n   timeout 600 ./autoresearch.sh > run.log 2>&1\n   ```\n   If the command times out or crashes, treat it as a failure.\n5. **Parse metrics**: Extract `METRIC` lines from the output:\n   ```bash\n   grep '^METRIC ' run.log\n   ```\n   If no METRIC lines found, the run crashed — read `tail -50 run.log` for the error.\n6. **Run checks** (if `autoresearch.checks.sh` exists and benchmark passed):\n   ```bash\n   timeout 300 ./autoresearch.checks.sh > checks.log 2>&1\n   ```\n7. **Evaluate and log**:\n   - **Improved** (primary metric better than best so far) → status `keep`. The commit stays.\n   - **Worse or equal** → status `discard`. Revert: stage autoresearch files first, then reset.\n   - **Crash** (benchmark failed) → status `crash`. Fix if trivial, otherwise revert and move on.\n   - **Checks failed** → status `checks_failed`. Revert.\n8. **Log to JSONL**: Append one line to `autoresearch.jsonl`:\n   ```json\n   {\"run\":1,\"commit\":\"a1b2c3d\",\"metric\":0.9979,\"metrics\":{\"val_bpb\":0.9979,\"peak_vram_mb\":45060.2},\"status\":\"keep\",\"description\":\"baseline\",\"timestamp\":1711036800000,\"confidence\":null}\n   ```\n9. **On discard/crash/checks_failed — revert code changes**:\n   ```bash\n   # Preserve autoresearch session files, revert everything else\n   git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true\n   git checkout -- .\n   git clean -fd\n   ```\n10. **Check confidence**: After 3+ runs, run the confidence script from the skill's installation directory:\n    ```bash\n    bash \"$(dirname \"$(readlink -f \"$0\")\")/scripts/confidence.sh\"\n    ```\n    Or locate it via the skill path and run it directly. Interpret the score:\n    - **>= 2.0x**: Improvement is likely real (green).\n    - **1.0-2.0x**: Above noise but marginal (yellow).\n    - **< 1.0x**: Within noise — consider re-running to confirm (red).\n11. **Update session**: Periodically update `autoresearch.md` \"What's Been Tried\" section and run the summary script to review progress.\n\nRepeat forever until interrupted.\n\n## JSONL Schema\n\nEach line in `autoresearch.jsonl` is a JSON object:\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `run` | number | 1-indexed experiment count |\n| `commit` | string | Short git SHA (7 chars) |\n| `metric` | number | Primary metric value |\n| `metrics` | object | All metrics dict (primary + secondary) |\n| `status` | string | `keep`, `discard`, `crash`, or `checks_failed` |\n| `description` | string | What this experiment tried |\n| `timestamp` | number | Unix timestamp (ms) |\n| `confidence` | number or null | MAD-based confidence score (null if <3 runs) |\n\n## Resuming\n\nWhen `autoresearch.md` exists in the working directory:\n\n1. Read `autoresearch.md` for full context (objective, what's been tried, constraints).\n2. Read `autoresearch.jsonl` to reconstruct state (best metric, run count, last segment).\n3. Read `git log --oneline -20` for recent commit history.\n4. Check `autoresearch.ideas.md` if it exists — prune stale entries, experiment with promising ones.\n5. Continue the loop from where it left off. Do not re-run the baseline.\n\n## Ideas Backlog\n\nWhen you discover complex but promising optimizations you won't pursue right now, append them as bullets to `autoresearch.ideas.md`. Don't let good ideas get lost.\n\nOn resume, check this file — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary to `autoresearch.md`.\n\n## Loop Rules\n\nSee `references/loop-rules.md` for the full reference. Key rules:\n\n- **Primary metric is king.** Improved → keep. Worse/equal → discard.\n- **Simpler is better.** Remove code for equal perf = keep. Ugly complexity for tiny gain = discard.\n- **Don't thrash.** Repeatedly reverting the same idea? Try something structurally different.\n- **Think longer when stuck.** Re-read source files, reason about what the CPU/compiler/runtime is actually doing. Deep understanding beats random variation.\n- **Crashes**: fix if trivial (typo, missing import), otherwise log and move on. Don't over-invest.\n- **NEVER STOP.** The user may be away for hours. Keep going until interrupted.\n\n## User Messages During Experiments\n\nIf the user sends a message while an experiment is running, finish the current run-evaluate-log cycle first, then incorporate their feedback in the next iteration.","tags":["autoresearch","agent","skills","paulrberg","agent-skills","ai-agents"],"capabilities":["skill","source-paulrberg","skill-autoresearch","topic-agent-skills","topic-ai-agents"],"categories":["agent-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/PaulRBerg/agent-skills/autoresearch","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add PaulRBerg/agent-skills","source_repo":"https://github.com/PaulRBerg/agent-skills","install_from":"skills.sh"}},"qualityScore":"0.475","qualityRationale":"deterministic score 0.47 from registry signals: · indexed on github topic:agent-skills · 50 github stars · SKILL.md body (8,735 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-04-22T00:56:16.936Z","embedding":null,"createdAt":"2026-04-18T22:17:31.992Z","updatedAt":"2026-04-22T00:56:16.936Z","lastSeenAt":"2026-04-22T00:56:16.936Z","tsv":"'-2.0':853 '-20':1001 '-50':513,678 '/autoresearch.checks.sh':695 '/autoresearch.sh':261,639 '/bin/bash':361,501 '/dev/null':801 '/scripts/confidence.sh':830 '/tmp/autoresearch-output.log':387,402 '0':533,829 '0.9979':762,766 '1':105,389,511,517,589,642,698,758,909,972 '1.0':852,860 '10':808 '11':871 '1711036800000':776 '1s':369 '2':129,388,405,510,516,609,641,697,800,984 '2.0':845 '3':138,625,812,962,996 '300':694 '4':154,633,1006 '45060.2':770 '5':181,655,1019 '5s':467 '6':196,683 '600':638 '7':699,918 '8':747 '9':779 'a1b2c3d':760 'abl':223 'accumul':318 'actual':1150 'add':628,794 'affect':550 'agent':217,273,329,345 'alon':228 'alreadi':83 'also':176 'anyth':148 'append':751,1050 'approach':334 'architectur':325 'argument':112 'ask':10,108,578 'ast':377 'ast.parse':378 'autonom':21,36,44,585 'autoresearch':1,13,24,43,135,240,723,787 'autoresearch.checks.sh':178,486,687,799 'autoresearch.ideas.md':603,798,1008,1055 'autoresearch.jsonl':97,755,795,899,986 'autoresearch.md':82,96,159,209,336,437,796,876,966,974,1089 'autoresearch.sh':161,349,797 'autoresearch/test-speed-2026-03-21':137 'away':1180 'awk':403 'b':134 'backlog':1036 'backpressur':488 'base':592,957 'baselin':183,194,206,774,1034 'bash':350,360,500,636,664,692,785,824,825 'beat':1154 'begin':199 'benchmark':119,355,383,466,531,635,690,729 'best':708,990 'better':256,706,1110 'binari':75 'bpb':398,401,409,411,447,765 'branch':131 'brief':278 'budget':307 'build':71 'bullet':1053 'bundl':67 'bundle.size':450 'c':375 'catch':370 'chang':190,622,784 'char':441,919 'check':175,367,489,535,541,545,570,685,741,744,809,938,1007,1065 'checkout':133,804 'checks.log':696 'choos':604 'clean':806 'code':597,611,783,1112 'command':117,645 'commit':179,626,631,714,759,913,1004 'complex':1040,1118 'confid':777,810,816,951,958 'confirm':869 'consid':864 'constraint':128,166,296,496,983 'context':107,220,348,977 'continu':102,581,1020 'convers':114 'correct':168,498 'count':912,993 'cpu/compiler/runtime':1148 'crash':649,675,728,732,936,1157 'creat':130,158,177,494 'current':1204 'cycl':1209 'data':293 'dead':322 'deep':1152 'deepli':145 'delet':1080 'dep':304 'descript':243,773,906,940 'dict':929 'differ':1134 'direct':123,841 'directori':88,823,971 'dirnam':826 'discard':54,720,935,1107,1122 'discard/crash/checks_failed':781 'discov':1039 'document':435 'doesn':56,330 'dot':442,509 'e.g':136,445 'earli':373 'echo':406 'edit':610 'effect':233 'els':792 'end':323 'entir':571 'entri':1014,1070 'equal':718,1114 'error':372,520,562,682 'especi':338 'establish':192 'etc':295,308 'euo':363,415,503 'evalu':291,700,1207 'everi':150,270,456,529 'everyth':791 'excel':238 'execut':184,546 'exhaust':1079 'exist':84,525,568,688,967,1011 'exit':532 'expect':584 'experi':22,29,37,45,103,187,201,317,539,573,624,911,944,1015,1071,1190,1199 'extract':390,658 'f':828 'fail':333,536,542,730,742,745,939 'failur':654 'far':710 'fast':368,455 'fast/noisy':465 'fd':807 'feedback':1214 'field':904 'file':124,141,151,157,227,267,271,524,565,617,724,789,1067,1082,1143 'final':1086 'finish':1202 'first':186,725,1210 'fix':305,733,1158 'focus':621 'forev':576,891 'formul':590 'found':672 'fresh':216 'full':347,976,1096 'gain':1121 'gather':106 'get':1061 'git':99,132,627,630,793,803,805,916,998 'go':1184 'goal':116 'good':1059 'green':851 'grep':399,518,665 'har':292 'hard':297 'heart':211 'histori':1005 'hour':1182 'hundr':461 'hypothesi':591 'idea':48,601,1035,1060,1130 'immedi':203 'import':376,1163 'improv':703,847,1104 'in-scop':614 'incorpor':1212 'index':910 'infer':110 'insid':471 'insight':326 'instal':822 'interpret':842 'interrupt':893,1186 'invest':234,1173 'iter':588,1218 'json':756,902 'jsonl':750,894 'kb':451 'keep':51,452,554,712,772,934,1105,1116,1183 'key':320,1098 'king':1103 'last':994 'latenc':77 'left':1026 'let':1058 'lighthous':73 'like':849 'limit':285 'line':266,396,421,660,671,753,897 'lint':492 'llm':69 'locat':832 'log':100,208,537,702,748,999,1165,1208 'longer':1136 'loop':18,30,38,46,94,198,202,232,483,574,575,1022,1090 'lost':1062 'lower/higher':254 'm':632 'mad':956 'mad-bas':955 'make':236,618 'margin':858 'markdown':239 'match':432 'may':274,1178 'mb':769 'measur':49 'median':477 'memori':78 'messag':1188,1196 'metric':121,195,252,263,359,393,395,407,418,426,429,438,553,657,659,666,670,705,761,763,920,923,925,928,991,1101 'minim':556 'miss':1162 'modifi':275,612 'move':739,1167 'ms':950 'multipl':469 'multipli':459 'must':171,174,287,300,431 'name':122,264,419,430,439 'need':485 'never':58,577,1174 'new':303 'next':608,1217 'nois':856,863 'note':279,319 'null':778,954,960 'number':908,921,947,952 'object':241,903,926,978 'one':424,752,1018 'onelin':1000 'open':379 'optim':14,25,41,63,248,1043 'option':487 'otherwis':104,736,1164 'output':262,357,392,417,555,663 'over-invest':1171 'overnight':27 'pars':656 'pass':172,301,530,691 'path':837,1077 'peak':767 'per':425,623 'perf':1115 'period':337,874 'pipefail':364,416,504 'pnpm':505,514 'pre':366 'pre-check':365 'prep':294 'preserv':786 'primari':120,253,428,552,704,922,930,1100 'print':404 'prior':594 'progress':559,889 'promis':1017,1042 'prune':1012,1068 'pursu':1047 'python3':374 'random':1155 're':247,866,1031,1140 're-read':1139 're-run':865,1030 'read':95,139,149,225,381,676,973,985,997,1141 'readlink':827 'real':850 'reason':1144 'recent':1003 'reconstruct':988 'red':870 'refer':1097 'references/loop-rules.md':1093 'remov':1111 'repeat':332,890,1126 'report':475,508 'requir':167,497 'reset':727 'rest':1074 'result':50,595 'resum':92,344,964,1064 'revert':544,721,737,746,782,790,1127 'review':888 'right':1048 'rule':298,412,1091,1099 'run':12,34,182,230,260,353,382,385,463,468,507,526,634,674,684,757,813,814,839,867,883,907,963,992,1032,1201,1206 'run-evaluate-log':1205 'run.log':640,667,679 'schema':895 'scope':126,153,269,616 'score':74,844,959 'script':351,454,473,480,817,886 'second':457 'secondari':257,931 'section':315,881 'see':162,1092 'segment':995 'send':1194 'session':156,214,788,873 'set':19,31,362,414,502 'setup':80,90 'sha':917 'short':915 'show':561 'simpler':1108 'singl':620 'size':68,76 'skill':3,820,836 'skill-autoresearch' 'skip':89,569 'someth':1132 'sourc':140,596,1142 'source-paulrberg' 'specif':242 'speed':66 'stage':722 'stale':1013 'stale/tried':1069 'start':23,197 'state':989 'status':711,719,731,743,771,932 'stay':715 'stdout':423 'stop':59,1175 'string':914,933,941 'structur':358,1133 'stuck':1138 'summari':885,1087 'suppress':557 'syntax':371 'tail':512,677 'target':42,64 'templat':163 'test':65,170,299,490,506 'think':1135 'thrash':1125 'time':72,235,306,470,547,646 'timeout':637,693 'timestamp':775,946,949 'tini':1120 'topic-agent-skills' 'topic-ai-agents' 'total':448 'touch':290 'train':70 'train.py':380,386 'treat':650 'tri':47,312,342,607,880,945,982,1131 'trivial':735,1160 'true':521,802 'type':173,491,905 'typecheck':515 'typo':1161 'ugli':1117 'understand':142,598,1153 'unix':948 'updat':313,335,478,872,875 'usag':79 'use':6,413 'user':9,583,1177,1187,1193 'uv':384 'val':397,400,408,410,446,764 'valid':169,499 'valu':265,420,924 'variat':1156 'verbos':558 'via':834 'vram':768 'win':321 'within':862 'won':1045 'word':440 'work':53,60,87,586,970 'workload':144,251 'wors':716 'worse/equal':1106 'write':147,155,1084 'x':15,26,846,854,861 'yellow':859 'µ':444 'µs':449","prices":[{"id":"85be06f5-89b6-4a7e-ad09-37c3cca75894","listingId":"77438244-03c6-4360-aa4a-4536f4a691e3","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"PaulRBerg","category":"agent-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T22:17:31.992Z"}],"sources":[{"listingId":"77438244-03c6-4360-aa4a-4536f4a691e3","source":"github","sourceId":"PaulRBerg/agent-skills/autoresearch","sourceUrl":"https://github.com/PaulRBerg/agent-skills/tree/main/skills/autoresearch","isPrimary":false,"firstSeenAt":"2026-04-18T22:17:31.992Z","lastSeenAt":"2026-04-22T00:56:16.936Z"}],"details":{"listingId":"77438244-03c6-4360-aa4a-4536f4a691e3","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"PaulRBerg","slug":"autoresearch","github":{"repo":"PaulRBerg/agent-skills","stars":50,"topics":["agent-skills","ai-agents"],"license":"mit","html_url":"https://github.com/PaulRBerg/agent-skills","pushed_at":"2026-04-20T16:22:56Z","description":"PRB's collection of agent skills","skill_md_sha":"c5eeede066be63068327a80f09c02632e8e50d0f","skill_md_path":"skills/autoresearch/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/PaulRBerg/agent-skills/tree/main/skills/autoresearch"},"layout":"multi","source":"github","category":"agent-skills","frontmatter":{"name":"autoresearch","description":"This skill should be used when the user asks to \"run autoresearch\", \"optimize X in a loop\", \"set up autonomous experiments\", \"start autoresearch\", \"optimize X overnight\", or \"experiment loop\". Sets up and runs an autonomous experiment loop for any optimization target."},"skills_sh_url":"https://skills.sh/PaulRBerg/agent-skills/autoresearch"},"updatedAt":"2026-04-22T00:56:16.936Z"}}