{"id":"77438244-03c6-4360-aa4a-4536f4a691e3","shortId":"GXTe5s","kind":"skill","title":"autoresearch","tagline":"This skill should be used when the user asks to \"run autoresearch\", \"optimize X in a loop\", \"set up autonomous experiments\", \"start autoresearch\", \"optimize X overnight\", or \"experiment loop\". Sets up and runs an autonomous experiment loop for any optimization target.","description":"# Autoresearch\n\nAutonomous experiment loop: try ideas, measure results, keep what works, discard what doesn't, never stop.\n\nWorks for any optimization target: test speed, bundle size, LLM training, build times, Lighthouse scores, binary size, latency, memory usage.\n\n## Setup\n\nIf `autoresearch.md` already exists in the working directory, **skip setup and resume the loop** — read `autoresearch.md`, `autoresearch.jsonl`, and `git log`, then continue experimenting.\n\nOtherwise:\n\n1. **Gather context**: Ask (or infer from `$ARGUMENTS` and conversation) the **Goal**, **Command** to benchmark, **Primary metric** (name + direction), **Files in scope**, and **Constraints**.\n2. **Create branch**: `git checkout -b autoresearch/<goal>-<date>` (e.g. `autoresearch/test-speed-2026-03-21`).\n3. **Read source files**: Understand the workload deeply before writing anything. Read every file in scope.\n4. **Write session files**: Create `autoresearch.md` and `autoresearch.sh` (see templates below). If constraints require correctness validation (tests must pass, types must check), also create `autoresearch.checks.sh`. Commit all.\n5. **Run baseline**: Execute the first experiment with no changes to establish the baseline metric.\n6. **Start looping**: Begin the experiment loop immediately after the baseline is logged.\n\n### `autoresearch.md`\n\nThe heart of the session. A fresh agent with no context should be able to read this file alone and run the loop effectively. Invest time making it excellent.\n\n```markdown\n# Autoresearch: <goal>\n\n## Objective\n<Specific description of what we're optimizing and the workload.>\n\n## Metrics\n- **Primary**: <name> (<unit>, lower/higher is better)\n- **Secondary**: <name>, <name>, ...\n\n## How to Run\n`./autoresearch.sh` — outputs `METRIC name=value` lines.\n\n## Files in Scope\n<Every file the agent may modify, with a brief note on what it does.>\n\n## Off Limits\n<What must NOT be touched — evaluation harness, data prep, etc.>\n\n## Constraints\n<Hard rules: tests must pass, no new deps, fixed time budget, etc.>\n\n## What's Been Tried\n<Update this section as experiments accumulate. Note key wins, dead ends,\nand architectural insights so the agent doesn't repeat failed approaches.>\n```\n\nUpdate `autoresearch.md` periodically — especially \"What's Been Tried\" — so resuming agents have full context.\n\n### `autoresearch.sh`\n\nBash script that runs the benchmark and outputs structured metrics.\n\n```bash\n#!/bin/bash\nset -euo pipefail\n\n# Pre-checks (fast, <1s — catch syntax errors early)\npython3 -c \"import ast; ast.parse(open('train.py').read())\"\n\n# Run benchmark\nuv run train.py > /tmp/autoresearch-output.log 2>&1\n\n# Extract and output metrics as METRIC lines\nval_bpb=$(grep \"^val_bpb:\" /tmp/autoresearch-output.log | awk '{print $2}')\necho \"METRIC val_bpb=$val_bpb\"\n```\n\nRules:\n\n- Use `set -euo pipefail`.\n- Output `METRIC name=value` lines to stdout (one per metric). The primary metric name must match what's documented in `autoresearch.md`.\n- Metric names: word chars, dots, or `µ` (e.g. `val_bpb`, `total_µs`, `bundle.size_kb`).\n- Keep the script fast — every second is multiplied by hundreds of runs.\n- For fast/noisy benchmarks (\\<5s), run multiple times inside the script and report the median.\n- Update the script during the loop as needed.\n\n### `autoresearch.checks.sh` (optional)\n\nBackpressure checks: tests, types, lint. **Only create when constraints require correctness validation.**\n\n```bash\n#!/bin/bash\nset -euo pipefail\npnpm test --run --reporter=dot 2>&1 | tail -50\npnpm typecheck 2>&1 | grep -i error || true\n```\n\nWhen this file exists:\n\n- Run it after every **passing** benchmark (exit 0).\n- If checks fail, log the experiment as `checks_failed` and revert.\n- Check execution time does NOT affect the primary metric.\n- Keep output minimal — suppress verbose progress, only show errors.\n\nWhen this file does not exist, skip checks entirely.\n\n## The Experiment Loop\n\n**LOOP FOREVER.** Never ask \"should I continue?\" — the user expects autonomous work.\n\nEach iteration:\n\n01. **Formulate hypothesis**: Based on prior results, source code understanding, and any ideas in `autoresearch.ideas.md`, choose what to try next.\n02. **Edit code**: Modify the in-scope files. Make a single, focused change per experiment.\n03. **Commit**: `git add -A && git commit -m \"<short description of what this experiment tries>\"`\n04. **Run benchmark**:\n    ```bash\n    timeout 600 ./autoresearch.sh > run.log 2>&1\n    ```\n    If the command times out or crashes, treat it as a failure.\n05. **Parse metrics**: Extract `METRIC` lines from the output:\n    ```bash\n    grep '^METRIC ' run.log\n    ```\n    If no METRIC lines found, the run crashed — read `tail -50 run.log` for the error.\n06. **Run checks** (if `autoresearch.checks.sh` exists and benchmark passed):\n    ```bash\n    timeout 300 ./autoresearch.checks.sh > checks.log 2>&1\n    ```\n07. **Evaluate and log**:\n    - **Improved** (primary metric better than best so far) → status `keep`. The commit stays.\n    - **Worse or equal** → status `discard`. Revert: stage autoresearch files first, then reset.\n    - **Crash** (benchmark failed) → status `crash`. Fix if trivial, otherwise revert and move on.\n    - **Checks failed** → status `checks_failed`. Revert.\n08. **Log to JSONL**: Append one line to `autoresearch.jsonl`:\n    ```json\n    {\"run\":1,\"commit\":\"a1b2c3d\",\"metric\":0.9979,\"metrics\":{\"val_bpb\":0.9979,\"peak_vram_mb\":45060.2},\"status\":\"keep\",\"description\":\"baseline\",\"timestamp\":1711036800000,\"confidence\":null}\n    ```\n09. **On discard/crash/checks_failed — revert code changes**:\n    ```bash\n    # Preserve autoresearch session files, revert everything else\n    git add autoresearch.jsonl autoresearch.md autoresearch.sh autoresearch.ideas.md autoresearch.checks.sh 2>/dev/null || true\n    git checkout -- .\n    git clean -fd\n    ```\n10. **Check confidence**: After 3+ runs, run the confidence script from the skill's installation directory:\n    ```bash\n    bash \"$(dirname \"$(readlink -f \"$0\")\")/scripts/confidence.sh\"\n    ```\n    Interpret the score:\n    - **>= 2.0x**: Improvement is likely real (green).\n    - **1.0-2.0x**: Above noise but marginal (yellow).\n    - **< 1.0x**: Within noise — consider re-running to confirm (red).\n11. **Update session**: Periodically update `autoresearch.md` \"What's Been Tried\" section and run the summary script to review progress.\n\nRepeat forever until interrupted.\n\n## JSONL Schema\n\nEach line in `autoresearch.jsonl` is a JSON object:\n\n| Field         | Type           | Description                                    |\n| ------------- | -------------- | ---------------------------------------------- |\n| `run`         | number         | 1-indexed experiment count                     |\n| `commit`      | string         | Short git SHA (7 chars)                        |\n| `metric`      | number         | Primary metric value                           |\n| `metrics`     | object         | All metrics dict (primary + secondary)         |\n| `status`      | string         | `keep`, `discard`, `crash`, or `checks_failed` |\n| `description` | string         | What this experiment tried                     |\n| `timestamp`   | number         | Unix timestamp (ms)                            |\n| `confidence`  | number or null | MAD-based confidence score (null if \\<3 runs)  |\n\n## Resuming\n\nWhen `autoresearch.md` exists in the working directory:\n\n1. Read `autoresearch.md` for full context (objective, what's been tried, constraints).\n2. Read `autoresearch.jsonl` to reconstruct state (best metric, run count, last segment).\n3. Read `git log --oneline -20` for recent commit history.\n4. Check `autoresearch.ideas.md` if it exists — prune stale entries, experiment with promising ones.\n5. Continue the loop from where it left off. Do not re-run the baseline.\n\n## Ideas Backlog\n\nWhen you discover complex but promising optimizations you won't pursue right now, append them as bullets to `autoresearch.ideas.md`. Don't let good ideas get lost.\n\nOn resume, check this file — prune stale/tried entries, experiment with the rest. When all paths are exhausted, delete the file and write a final summary to `autoresearch.md`.\n\n## Loop Rules\n\nSee `references/loop-rules.md` for the full reference. Key rules:\n\n- **Primary metric is king.** Improved → keep. Worse/equal → discard.\n- **Simpler is better.** Remove code for equal perf = keep. Ugly complexity for tiny gain = discard.\n- **Don't thrash.** Repeatedly reverting the same idea? Try something structurally different.\n- **Think longer when stuck.** Re-read source files, reason about what the CPU/compiler/runtime is actually doing. Deep understanding beats random variation.\n- **Crashes**: fix if trivial (typo, missing import), otherwise log and move on. Don't over-invest.\n- **NEVER STOP.** The user may be away for hours. Keep going until interrupted.\n\n## User Messages During Experiments\n\nIf the user sends a message while an experiment is running, finish the current run-evaluate-log cycle first, then incorporate their feedback in the next iteration.","tags":["autoresearch","agent","skills","paulrberg","agent-skills","ai-agents"],"capabilities":["skill","source-paulrberg","skill-autoresearch","topic-agent-skills","topic-ai-agents"],"categories":["agent-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/PaulRBerg/agent-skills/autoresearch","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add PaulRBerg/agent-skills","source_repo":"https://github.com/PaulRBerg/agent-skills","install_from":"skills.sh"}},"qualityScore":"0.478","qualityRationale":"deterministic score 0.48 from registry signals: · indexed on github topic:agent-skills · 56 github stars · SKILL.md body (9,036 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:57:35.052Z","embedding":null,"createdAt":"2026-04-18T22:17:31.992Z","updatedAt":"2026-05-18T18:57:35.052Z","lastSeenAt":"2026-05-18T18:57:35.052Z","tsv":"'-2.0':842 '-20':990 '-50':513,678 '/autoresearch.checks.sh':695 '/autoresearch.sh':261,639 '/bin/bash':361,501 '/dev/null':801 '/scripts/confidence.sh':830 '/tmp/autoresearch-output.log':387,402 '0':533,829 '0.9979':762,766 '01':589 '02':609 '03':625 '04':633 '05':655 '06':683 '07':699 '08':747 '09':779 '1':105,389,511,517,642,698,758,898,961 '1.0':841,849 '10':808 '11':860 '1711036800000':776 '1s':369 '2':129,388,405,510,516,641,697,800,973 '2.0':834 '3':138,812,951,985 '300':694 '4':154,995 '45060.2':770 '5':181,1008 '5s':467 '6':196 '600':638 '7':907 'a1b2c3d':760 'abl':223 'accumul':318 'actual':1139 'add':628,794 'affect':550 'agent':217,273,329,345 'alon':228 'alreadi':83 'also':176 'anyth':148 'append':751,1039 'approach':334 'architectur':325 'argument':112 'ask':10,108,578 'ast':377 'ast.parse':378 'autonom':21,36,44,585 'autoresearch':1,13,24,43,135,240,723,787 'autoresearch.checks.sh':178,486,687,799 'autoresearch.ideas.md':603,798,997,1044 'autoresearch.jsonl':97,755,795,888,975 'autoresearch.md':82,96,159,209,336,437,796,865,955,963,1078 'autoresearch.sh':161,349,797 'autoresearch/test-speed-2026-03-21':137 'away':1169 'awk':403 'b':134 'backlog':1025 'backpressur':488 'base':592,946 'baselin':183,194,206,774,1023 'bash':350,360,500,636,664,692,785,824,825 'beat':1143 'begin':199 'benchmark':119,355,383,466,531,635,690,729 'best':708,979 'better':256,706,1099 'binari':75 'bpb':398,401,409,411,447,765 'branch':131 'brief':278 'budget':307 'build':71 'bullet':1042 'bundl':67 'bundle.size':450 'c':375 'catch':370 'chang':190,622,784 'char':441,908 'check':175,367,489,535,541,545,570,685,741,744,809,927,996,1054 'checkout':133,804 'checks.log':696 'choos':604 'clean':806 'code':597,611,783,1101 'command':117,645 'commit':179,626,631,714,759,902,993 'complex':1029,1107 'confid':777,810,816,940,947 'confirm':858 'consid':853 'constraint':128,166,296,496,972 'context':107,220,348,966 'continu':102,581,1009 'convers':114 'correct':168,498 'count':901,982 'cpu/compiler/runtime':1137 'crash':649,675,728,732,925,1146 'creat':130,158,177,494 'current':1193 'cycl':1198 'data':293 'dead':322 'deep':1141 'deepli':145 'delet':1069 'dep':304 'descript':243,773,895,929 'dict':918 'differ':1123 'direct':123 'directori':88,823,960 'dirnam':826 'discard':54,720,924,1096,1111 'discard/crash/checks_failed':781 'discov':1028 'document':435 'doesn':56,330 'dot':442,509 'e.g':136,445 'earli':373 'echo':406 'edit':610 'effect':233 'els':792 'end':323 'entir':571 'entri':1003,1059 'equal':718,1103 'error':372,520,562,682 'especi':338 'establish':192 'etc':295,308 'euo':363,415,503 'evalu':291,700,1196 'everi':150,270,456,529 'everyth':791 'excel':238 'execut':184,546 'exhaust':1068 'exist':84,525,568,688,956,1000 'exit':532 'expect':584 'experi':22,29,37,45,103,187,201,317,539,573,624,900,933,1004,1060,1179,1188 'extract':390,658 'f':828 'fail':333,536,542,730,742,745,928 'failur':654 'far':710 'fast':368,455 'fast/noisy':465 'fd':807 'feedback':1203 'field':893 'file':124,141,151,157,227,267,271,524,565,617,724,789,1056,1071,1132 'final':1075 'finish':1191 'first':186,725,1199 'fix':305,733,1147 'focus':621 'forev':576,880 'formul':590 'found':672 'fresh':216 'full':347,965,1085 'gain':1110 'gather':106 'get':1050 'git':99,132,627,630,793,803,805,905,987 'go':1173 'goal':116 'good':1048 'green':840 'grep':399,518,665 'har':292 'hard':297 'heart':211 'histori':994 'hour':1171 'hundr':461 'hypothesi':591 'idea':48,601,1024,1049,1119 'immedi':203 'import':376,1152 'improv':703,836,1093 'in-scop':614 'incorpor':1201 'index':899 'infer':110 'insid':471 'insight':326 'instal':822 'interpret':831 'interrupt':882,1175 'invest':234,1162 'iter':588,1207 'json':756,891 'jsonl':750,883 'kb':451 'keep':51,452,554,712,772,923,1094,1105,1172 'key':320,1087 'king':1092 'last':983 'latenc':77 'left':1015 'let':1047 'lighthous':73 'like':838 'limit':285 'line':266,396,421,660,671,753,886 'lint':492 'llm':69 'log':100,208,537,702,748,988,1154,1197 'longer':1125 'loop':18,30,38,46,94,198,202,232,483,574,575,1011,1079 'lost':1051 'lower/higher':254 'm':632 'mad':945 'mad-bas':944 'make':236,618 'margin':847 'markdown':239 'match':432 'may':274,1167 'mb':769 'measur':49 'median':477 'memori':78 'messag':1177,1185 'metric':121,195,252,263,359,393,395,407,418,426,429,438,553,657,659,666,670,705,761,763,909,912,914,917,980,1090 'minim':556 'miss':1151 'modifi':275,612 'move':739,1156 'ms':939 'multipl':469 'multipli':459 'must':171,174,287,300,431 'name':122,264,419,430,439 'need':485 'never':58,577,1163 'new':303 'next':608,1206 'nois':845,852 'note':279,319 'null':778,943,949 'number':897,910,936,941 'object':241,892,915,967 'one':424,752,1007 'onelin':989 'open':379 'optim':14,25,41,63,248,1032 'option':487 'otherwis':104,736,1153 'output':262,357,392,417,555,663 'over-invest':1160 'overnight':27 'pars':656 'pass':172,301,530,691 'path':1066 'peak':767 'per':425,623 'perf':1104 'period':337,863 'pipefail':364,416,504 'pnpm':505,514 'pre':366 'pre-check':365 'prep':294 'preserv':786 'primari':120,253,428,552,704,911,919,1089 'print':404 'prior':594 'progress':559,878 'promis':1006,1031 'prune':1001,1057 'pursu':1036 'python3':374 'random':1144 're':247,855,1020,1129 're-read':1128 're-run':854,1019 'read':95,139,149,225,381,676,962,974,986,1130 'readlink':827 'real':839 'reason':1133 'recent':992 'reconstruct':977 'red':859 'refer':1086 'references/loop-rules.md':1082 'remov':1100 'repeat':332,879,1115 'report':475,508 'requir':167,497 'reset':727 'rest':1063 'result':50,595 'resum':92,344,953,1053 'revert':544,721,737,746,782,790,1116 'review':877 'right':1037 'rule':298,412,1080,1088 'run':12,34,182,230,260,353,382,385,463,468,507,526,634,674,684,757,813,814,856,872,896,952,981,1021,1190,1195 'run-evaluate-log':1194 'run.log':640,667,679 'schema':884 'scope':126,153,269,616 'score':74,833,948 'script':351,454,473,480,817,875 'second':457 'secondari':257,920 'section':315,870 'see':162,1081 'segment':984 'send':1183 'session':156,214,788,862 'set':19,31,362,414,502 'setup':80,90 'sha':906 'short':904 'show':561 'simpler':1097 'singl':620 'size':68,76 'skill':3,820 'skill-autoresearch' 'skip':89,569 'someth':1121 'sourc':140,596,1131 'source-paulrberg' 'specif':242 'speed':66 'stage':722 'stale':1002 'stale/tried':1058 'start':23,197 'state':978 'status':711,719,731,743,771,921 'stay':715 'stdout':423 'stop':59,1164 'string':903,922,930 'structur':358,1122 'stuck':1127 'summari':874,1076 'suppress':557 'syntax':371 'tail':512,677 'target':42,64 'templat':163 'test':65,170,299,490,506 'think':1124 'thrash':1114 'time':72,235,306,470,547,646 'timeout':637,693 'timestamp':775,935,938 'tini':1109 'topic-agent-skills' 'topic-ai-agents' 'total':448 'touch':290 'train':70 'train.py':380,386 'treat':650 'tri':47,312,342,607,869,934,971,1120 'trivial':735,1149 'true':521,802 'type':173,491,894 'typecheck':515 'typo':1150 'ugli':1106 'understand':142,598,1142 'unix':937 'updat':313,335,478,861,864 'usag':79 'use':6,413 'user':9,583,1166,1176,1182 'uv':384 'val':397,400,408,410,446,764 'valid':169,499 'valu':265,420,913 'variat':1145 'verbos':558 'vram':768 'win':321 'within':851 'won':1034 'word':440 'work':53,60,87,586,959 'workload':144,251 'wors':716 'worse/equal':1095 'write':147,155,1073 'x':15,26,835,843,850 'yellow':848 'µ':444 'µs':449","prices":[{"id":"85be06f5-89b6-4a7e-ad09-37c3cca75894","listingId":"77438244-03c6-4360-aa4a-4536f4a691e3","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"PaulRBerg","category":"agent-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T22:17:31.992Z"}],"sources":[{"listingId":"77438244-03c6-4360-aa4a-4536f4a691e3","source":"github","sourceId":"PaulRBerg/agent-skills/autoresearch","sourceUrl":"https://github.com/PaulRBerg/agent-skills/tree/main/skills/autoresearch","isPrimary":false,"firstSeenAt":"2026-04-18T22:17:31.992Z","lastSeenAt":"2026-05-18T18:57:35.052Z"}],"details":{"listingId":"77438244-03c6-4360-aa4a-4536f4a691e3","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"PaulRBerg","slug":"autoresearch","github":{"repo":"PaulRBerg/agent-skills","stars":56,"topics":["agent-skills","ai-agents"],"license":"mit","html_url":"https://github.com/PaulRBerg/agent-skills","pushed_at":"2026-05-17T10:33:19Z","description":"PRB's collection of agent skills","skill_md_sha":"3a85a571e19ee0c829f0ca5dda5c25292e46741a","skill_md_path":"skills/autoresearch/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/PaulRBerg/agent-skills/tree/main/skills/autoresearch"},"layout":"multi","source":"github","category":"agent-skills","frontmatter":{"name":"autoresearch","description":"This skill should be used when the user asks to \"run autoresearch\", \"optimize X in a loop\", \"set up autonomous experiments\", \"start autoresearch\", \"optimize X overnight\", or \"experiment loop\". Sets up and runs an autonomous experiment loop for any optimization target."},"skills_sh_url":"https://skills.sh/PaulRBerg/agent-skills/autoresearch"},"updatedAt":"2026-05-18T18:57:35.052Z"}}