{"id":"55a63cbe-e389-4608-8b4c-38485e200314","shortId":"neuTUb","kind":"skill","title":"clean-data","tagline":"Interactive data profiling and cleaning assistant for medical research. Three-stage workflow (profile, flag, code-generate) with user approval gates at each step. Handles missing values, outliers, duplicates, and type mismatches in CSV/Excel clinical data. Does NOT auto-clean — a","description":"# Data Profiling and Cleaning Skill\n\nYou are assisting a medical researcher with data profiling and cleaning for clinical datasets.\nThis is a three-stage interactive workflow. You generate code and reports -- you do NOT\nauto-clean data. Every cleaning decision requires explicit researcher confirmation.\n\n## Philosophy\n\nThis skill is a PROFILING AND FLAGGING ASSISTANT, not an automated data cleaner.\nClinical data cleaning requires domain expertise that an LLM cannot replace.\nEvery cleaning decision must be confirmed by the researcher.\n\n**DATA PRIVACY WARNING**\n\nIf your dataset contains Protected Health Information (PHI) or Personally Identifiable\nInformation (PII), run `/deidentify` first to remove PHI before proceeding. The deidentify\nskill provides a standalone Python script (no LLM) that scans for Korean SSN, phone numbers,\nnames, dates, and addresses, then anonymizes them with your confirmation.\n\nIf `*_deidentified.*` files exist in the working directory, use those instead of raw data.\n\nAlternatively:\n1. Provide only the data dictionary / codebook for profiling guidance\n2. Or use a local-only environment with no network access\n\nThis tool generates CODE that runs on your data -- it does not need to see the raw data\nto generate useful profiling scripts.\n\n## Reference Files\n\n- **Profiling template**: `${CLAUDE_SKILL_DIR}/references/profiling_template.py` -- reusable profiling script\n- **Cleaning patterns**: `${CLAUDE_SKILL_DIR}/references/cleaning_patterns.md` -- common clinical data patterns\n\nRead relevant references before generating profiling or cleaning code.\n\n## Three-Stage Workflow\n\n### Stage 1: Profiling\n\n**Input**: CSV/Excel file path OR data dictionary/codebook\n\n**Actions**:\n\n1. Generate a Python profiling script (pandas-based) that produces:\n   - Variable count, row count, data types\n   - Missing value count and percentage per variable\n   - Unique value counts for categorical variables\n   - Min/max/mean/median/SD for numeric variables\n   - Distribution plots (histograms for numeric, bar charts for categorical)\n2. If user provides a codebook: cross-reference variable names, expected types, expected ranges\n3. Present summary table to user\n\nUse `${CLAUDE_SKILL_DIR}/references/profiling_template.py` as the base script. Adapt it to\nthe specific dataset structure.\n\n**Gate**: User reviews profiling output before proceeding. Ask:\n> \"Here is the profiling summary. Would you like to proceed to Stage 2 (Flagging)?\n> Are there any variables you want to exclude or focus on?\"\n\n### Stage 2: Flagging\n\nBased on profiling results, flag potential issues in these categories:\n\n1. **Missing values**: Variables with >5% missing, pattern analysis (MCAR/MAR/MNAR heuristic)\n2. **Statistical outliers**: IQR method (Q1 - 1.5*IQR, Q3 + 1.5*IQR) and Z-score (|z| > 3)\n3. **Duplicates**: Exact row duplicates AND near-duplicates (same patient ID, different dates)\n4. **Type mismatches**: Numeric stored as string, dates in inconsistent formats\n5. **Implausible values**: ONLY if codebook provides valid ranges; otherwise flag as \"review needed\"\n6. **Category inconsistencies**: Typos in categorical values (e.g., \"Male\", \"male\", \"M\", \"MALE\")\n\nPresent the flag report as a structured table:\n\n| Variable | Issue Type | Count | Severity | Suggested Action |\n|----------|-----------|-------|----------|-----------------|\n| age | Outlier (IQR) | 3 | Medium | Review: values 150, 200, -5 |\n| sex | Category inconsistency | 12 | Low | Harmonize: Male/male/M -> \"Male\" |\n| lab_date | Type mismatch | 45 | High | Parse to datetime |\n\nSeverity levels:\n- **High**: Likely data errors that will affect analysis (type mismatches, impossible values)\n- **Medium**: Potential issues that need expert review (statistical outliers, moderate missingness)\n- **Low**: Minor inconsistencies that are easy to fix (category labels, trailing whitespace)\n\n**Gate**: User reviews flags and approves/rejects each suggested action. Ask:\n> \"Please review the flagged issues above. For each row, indicate:\n> (A) Approve the suggested action, (R) Reject / keep as-is, or (M) Modify the action.\n> Only approved actions will generate cleaning code.\"\n\n### Stage 3: Code Generation\n\nFor ONLY user-approved cleaning actions, generate Python (or R if requested) code:\n\n- **Missing value handling**: Listwise deletion, mean/median imputation, or MICE setup (code only, user runs)\n- **Outlier handling**: Winsorization, removal, or keep-and-flag\n- **Duplicate removal**: Exact dedup with logging\n- **Type conversion**: Standardize dates, numeric parsing\n- **Category harmonization**: Mapping table for inconsistent labels\n\nAll generated code MUST include:\n- Before/after row counts printed to console\n- Logging of every modification to a cleaning log DataFrame\n- Reproducibility: `np.random.seed(42)` and `random.seed(42)` where applicable\n- Output: cleaned CSV + `cleaning_log.csv`\n- Clear comments explaining each cleaning step\n\nEnd the generated script with this notice:\n> \"This code implements ONLY the cleaning rules you approved. Review the cleaning_log.csv\n> output to verify all changes before proceeding to analysis.\"\n\n## Scope Limitations\n\n**Supported**:\n- Missing values (detection, simple imputation code, MICE setup)\n- Outliers (statistical detection via IQR and Z-score)\n- Duplicates (exact and near-duplicate detection)\n- Type mismatches (numeric parsing, date standardization)\n- Category harmonization (case, abbreviation, whitespace)\n\n**NOT supported**:\n- Domain-specific plausible ranges (unless codebook provided)\n- Complex imputation strategy selection (MICE setup only, user picks variables/method)\n- Natural language extraction from clinical notes\n- Image data cleaning or DICOM metadata\n- Automated decisions -- all cleaning requires researcher approval\n\n> This tool flags issues. Final cleaning decisions require your domain knowledge.\n\n## Cross-Skill Integration\n\n- **clean-data** sits BEFORE `analyze-stats` in the research pipeline\n- `design-study` can inform which variables to focus profiling on\n- `manage-project` tracks overall project state including data cleaning status\n- After cleaning, hand off to `analyze-stats` for statistical analysis\n\n## Output Format\n\nStructure all reports using this template:\n\n```\n## Data Profiling Report\n\n### Dataset Overview\n- Rows: [N]\n- Columns: [N]\n- File size: [size]\n- Date range: [if applicable]\n\n### Variable Summary\n| Variable | Type | Missing N (%) | Unique | Min | Max | Mean | SD |\n|----------|------|---------------|--------|-----|-----|------|-----|\n| ...      | ...  | ...           | ...    | ... | ... | ...  | ... |\n\n### Flags\n| Variable | Issue | Count | Severity | Suggested Action |\n|----------|-------|-------|----------|-----------------|\n| ...      | ...   | ...   | ...      | ...             |\n\n### Cleaning Code\n[Python/R script -- only for approved actions]\n\n### Cleaning Log\n[What was changed, how many rows affected, before/after counts]\n```\n\n## Anti-Hallucination\n\n- **Never fabricate variable names, dataset column names, or variable codings.** If a variable mapping is uncertain, output `[VERIFY: variable_name]` and ask the user to confirm against the data dictionary.\n- **Never fabricate statistical results** — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.\n- **Never generate references from memory.** Use `/search-lit` for all citations.\n- If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.","tags":["clean","data","medsci","skills","aperivue","agent-skills","biostatistics","claude-code","claude-skills","clinical-research","diagnostic-accuracy","irb-protocol"],"capabilities":["skill","source-aperivue","skill-clean-data","topic-agent-skills","topic-biostatistics","topic-claude-code","topic-claude-skills","topic-clinical-research","topic-diagnostic-accuracy","topic-irb-protocol","topic-literature-review","topic-manuscript","topic-medical-ai","topic-medical-research","topic-meta-analysis"],"categories":["medsci-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/Aperivue/medsci-skills/clean-data","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add Aperivue/medsci-skills","source_repo":"https://github.com/Aperivue/medsci-skills","install_from":"skills.sh"}},"qualityScore":"0.499","qualityRationale":"deterministic score 0.50 from registry signals: · indexed on github topic:agent-skills · 98 github stars · SKILL.md body (7,510 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:56:29.008Z","embedding":null,"createdAt":"2026-05-13T12:57:44.134Z","updatedAt":"2026-05-18T18:56:29.008Z","lastSeenAt":"2026-05-18T18:56:29.008Z","tsv":"'-5':512 '/deidentify':144 '/references/cleaning_patterns.md':254 '/references/profiling_template.py':245,351 '/search-lit':997 '1':193,273,283,409 '1.5':426,429 '12':516 '150':510 '2':203,326,383,397,420 '200':511 '3':341,436,437,506,611 '4':451 '42':692,695 '45':525 '5':414,462 '6':476 'abbrevi':772 'access':214 'action':282,502,575,591,602,605,620,914,922 'adapt':356 'address':171 'affect':538,931 'age':503 'altern':192 'analysi':417,539,735,872 'analyz':834,868 'analyze-stat':833,867 'anonym':173 'anti':935 'anti-hallucin':934 'api':1006 'applic':697,896 'approv':24,588,604,618,723,812,921 'approves/rejects':572 'as-i':595 'ask':370,576,958 'assist':9,54,101 'auto':44,83 'auto-clean':43,82 'autom':104,806 'bar':322 'base':291,354,399 'before/after':675,932 'cannot':116 'case':771 'categor':311,325,481 'categori':408,477,514,563,663,769 'chang':731,927 'chart':323 'citat':1000 'claud':242,251,348 'clean':2,8,45,50,62,84,87,109,119,249,266,608,619,687,699,706,720,802,809,818,829,860,863,915,923 'clean-data':1,828 'cleaner':106 'cleaning_log.csv':701,726 'clear':702 'clinic':39,64,107,256,798 'code':20,76,218,267,609,612,627,638,672,716,744,916,946,989 'code-gener':19 'codebook':199,331,467,782 'column':888,942 'come':986 'comment':703 'common':255 'complex':784 'confid':978 'confirm':92,123,177,962 'consol':680 'contain':133 'convers':658 'count':295,297,302,309,499,677,911,933 'cross':333,825 'cross-refer':332 'cross-skil':824 'csv':700 'csv/excel':38,276 'data':3,5,40,47,59,85,105,108,127,191,197,223,232,257,280,298,534,801,830,859,881,965 'datafram':689 'dataset':65,132,361,884,941 'date':169,450,458,522,660,767,893 'datetim':529 'decis':88,120,807,819 'dedup':654 'deidentifi':152,179 'delet':632 'design':841 'design-studi':840 'detect':741,749,762 'dicom':804 'dictionari':198,966 'dictionary/codebook':281 'differ':449 'dir':244,253,350 'directori':185 'distribut':317 'domain':111,777,822 'domain-specif':776 'duplic':33,438,441,445,651,756,761 'e.g':483 'easi':560 'effect':976 'end':708 'environ':210 'error':535 'everi':86,118,683 'exact':439,653,757 'exclud':392 'execut':988 'exist':181,1009 'expect':337,339 'expert':549 'expertis':112 'explain':704 'explicit':90,1016 'extract':796 'fabric':938,968 'file':180,239,277,890 'final':817 'first':145 'fix':562 'flag':18,100,384,398,403,472,490,570,580,650,815,908 'focus':394,848 'format':461,874 'function':1003 'gate':25,363,567 'generat':21,75,217,234,263,284,607,613,621,671,710,992 'guess':1019 'guidanc':202 'hallucin':936 'hand':864 'handl':29,630,643 'harmon':518,664,770 'health':135 'heurist':419 'high':526,532 'histogram':319 'id':448 'identifi':140 'imag':800 'implaus':463 'implement':717 'imposs':542 'imput':634,743,785 'includ':674,858 'inconsist':460,478,515,557,668 'indic':586 'inform':136,141,844 'input':275 'instead':188 'integr':827 'interact':4,72 'interv':979 'invent':972 'iqr':423,427,430,505,751 'issu':405,497,546,581,816,910 'keep':594,648 'keep-and-flag':647 'knowledg':823 'korean':164 'lab':521 'label':564,669 'languag':795 'level':531 'like':378,533 'limit':737 'listwis':631 'llm':115,160 'local':208 'local-on':207 'log':656,681,688,924 'low':517,555 'm':486,599 'male':484,485,487,520 'male/male/m':519 'manag':852 'manage-project':851 'mani':929 'map':665,950 'max':905 'mcar/mar/mnar':418 'mean':906 'mean/median':633 'medic':11,56 'medium':507,544 'memori':995 'metadata':805 'method':424 'mice':636,745,788 'min':904 'min/max/mean/median/sd':313 'minor':556 'mismatch':36,453,524,541,764 'miss':30,300,410,415,628,739,901 'missing':554 'moder':553 'modif':684 'modifi':600 'must':121,673,985 'n':887,889,902 'name':168,336,940,943,956 'natur':794 'near':444,760 'near-dupl':443,759 'need':227,475,548 'network':213 'never':937,967,991 'note':799 'notic':714 'np.random.seed':691 'number':167,984 'numer':315,321,454,661,765 'otherwis':471 'outlier':32,422,504,552,642,747 'output':367,698,727,873,953,990 'overal':855 'overview':885 'p':974 'p-valu':973 'packag':1004 'panda':290 'pandas-bas':289 'pars':527,662,766 'path':278 'patient':447 'pattern':250,258,416 'per':305 'percentag':304 'person':139 'phi':137,148 'philosophi':93 'phone':166 'pick':792 'pii':142 'pipelin':839 'plausibl':779 'pleas':577 'plot':318 'potenti':404,545 'present':342,488 'print':678 'privaci':128 'proceed':150,369,380,733 'produc':293 'profil':6,17,48,60,98,201,236,240,247,264,274,287,366,374,401,849,882 'project':853,856 'protect':134 'provid':154,194,329,468,783 'python':157,286,622 'python/r':917 'q1':425 'q3':428 'r':592,624 'random.seed':694 'rang':340,470,780,894 'rather':1017 'raw':190,231 'read':259 'refer':238,261,334,993 'reject':593 'relev':260 'remov':147,645,652 'replac':117 'report':78,491,877,883 'reproduc':690 'request':626 'requir':89,110,810,820 'research':12,57,91,126,811,838 'result':402,970 'reusabl':246 'review':365,474,508,550,569,578,724 'row':296,440,585,676,886,930 'rule':721 'run':143,220,641 'sampl':981 'say':1014 'scan':162 'scope':736 'score':434,755 'script':158,237,248,288,355,711,918 'sd':907 'see':229 'select':787 'setup':637,746,789 'sever':500,530,912 'sex':513 'simpl':742 'sit':831 'size':891,892,977,982 'skill':51,95,153,243,252,349,826 'skill-clean-data' 'source-aperivue' 'specif':360,778 'ssn':165 'stage':15,71,270,272,382,396,610 'standalon':156 'standard':659,768 'stat':835,869 'state':857 'statist':421,551,748,871,969 'status':861 'step':28,707 'store':455 'strategi':786 'string':457 'structur':362,494,875 'studi':842 'suggest':501,574,590,913 'summari':343,375,898 'support':738,775 'tabl':344,495,666 'templat':241,880 'three':14,70,269 'three-stag':13,69,268 'tool':216,814 'topic-agent-skills' 'topic-biostatistics' 'topic-claude-code' 'topic-claude-skills' 'topic-clinical-research' 'topic-diagnostic-accuracy' 'topic-irb-protocol' 'topic-literature-review' 'topic-manuscript' 'topic-medical-ai' 'topic-medical-research' 'topic-meta-analysis' 'track':854 'trail':565 'type':35,299,338,452,498,523,540,657,763,900 'typo':479 'uncertain':952 'uniqu':307,903 'unless':781 'unsur':1013 'use':186,205,235,347,878,996 'user':23,328,346,364,568,617,640,791,960 'user-approv':616 'valid':469 'valu':31,301,308,411,464,482,509,543,629,740,975 'variabl':294,306,312,316,335,388,412,496,846,897,899,909,939,945,949,955 'variables/method':793 'verifi':729,954 'via':750 'want':390 'warn':129 'whitespac':566,773 'winsor':644 'work':184 'workflow':16,73,271 'would':376 'z':433,435,754 'z-score':432,753","prices":[{"id":"0be78460-ef4b-4216-ab48-d5967a179f8b","listingId":"55a63cbe-e389-4608-8b4c-38485e200314","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"Aperivue","category":"medsci-skills","install_from":"skills.sh"},"createdAt":"2026-05-13T12:57:44.134Z"}],"sources":[{"listingId":"55a63cbe-e389-4608-8b4c-38485e200314","source":"github","sourceId":"Aperivue/medsci-skills/clean-data","sourceUrl":"https://github.com/Aperivue/medsci-skills/tree/main/skills/clean-data","isPrimary":false,"firstSeenAt":"2026-05-13T12:57:44.134Z","lastSeenAt":"2026-05-18T18:56:29.008Z"}],"details":{"listingId":"55a63cbe-e389-4608-8b4c-38485e200314","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"Aperivue","slug":"clean-data","github":{"repo":"Aperivue/medsci-skills","stars":98,"topics":["agent-skills","biostatistics","claude-code","claude-skills","clinical-research","diagnostic-accuracy","irb-protocol","literature-review","manuscript","medical-ai","medical-research","meta-analysis","physician-researcher","prisma","pubmed","radiology","reporting-guidelines","strobe","systematic-review","tripod-ai"],"license":"other","html_url":"https://github.com/Aperivue/medsci-skills","pushed_at":"2026-05-17T20:50:52Z","description":"Claude Code skills for medical research — literature search, reporting guidelines, statistical analysis, publication figures. Built by a physician-researcher, tested on real publications. MIT licensed.","skill_md_sha":"b2181462ece759a7c4a5407eadc353bcf68756f7","skill_md_path":"skills/clean-data/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/Aperivue/medsci-skills/tree/main/skills/clean-data"},"layout":"multi","source":"github","category":"medsci-skills","frontmatter":{"name":"clean-data","description":"Interactive data profiling and cleaning assistant for medical research. Three-stage workflow (profile, flag, code-generate) with user approval gates at each step. Handles missing values, outliers, duplicates, and type mismatches in CSV/Excel clinical data. Does NOT auto-clean — all decisions require researcher confirmation."},"skills_sh_url":"https://skills.sh/Aperivue/medsci-skills/clean-data"},"updatedAt":"2026-05-18T18:56:29.008Z"}}