{"id":"38b4a656-1e08-440a-a8ed-2b495a974aaa","shortId":"R9nRUq","kind":"skill","title":"deidentify","tagline":"De-identify clinical research data before LLM-assisted analysis. Standalone Python CLI detects PHI via regex + heuristics with 10 country locale packs (kr, us, jp, cn, de, uk, fr, ca, au, in). Interactive terminal review. No LLM touches raw data — the script runs locally without ","description":"# De-identification Skill\n\nYou are guiding a medical researcher through data de-identification. The actual\nde-identification is performed by a **standalone Python script** that runs WITHOUT\nany LLM. Your role is to explain, guide, and verify — not to see or process raw\nPHI data.\n\n## Critical Safety Rules\n\n1. **NEVER ask the user to paste, show, or upload raw data containing PHI.**\n   The script processes data locally. You never need to see patient-level data.\n2. **NEVER read or display the mapping file contents.** It contains original PHI values.\n3. **You may read** the scan report (column classifications, no raw values), audit log\n   (SHA-256 hashes only), and de-identified output (PHI already removed).\n4. **Always communicate in the user's preferred language** about the process, but use\n   English for technical terms (PHI, HIPAA, Safe Harbor, etc.).\n\n## Reference Files\n\n- `${CLAUDE_SKILL_DIR}/references/hipaa_18_identifiers.md` — HIPAA Safe Harbor checklist\n- `${CLAUDE_SKILL_DIR}/references/korean_phi_patterns.md` — Korean-specific regex patterns\n- `${CLAUDE_SKILL_DIR}/references/date_shift_guide.md` — Date shifting best practices\n\nRead relevant references before advising the researcher.\n\n## Prerequisites\n\n- Python 3.10+\n- `openpyxl` (for .xlsx files): `pip install openpyxl`\n- Supported formats: CSV, TSV, Excel (.xlsx)\n\n## Five-Phase Workflow\n\n### Phase 1: Assessment\n\nAsk the researcher:\n1. What file format is the data? (CSV, Excel, etc.)\n2. What PHI do you expect in the data? (names, dates, IDs, etc.)\n3. Does your IRB require specific de-identification documentation?\n4. Do you need to re-identify later? (affects mapping file choice)\n\nBased on answers, recommend the appropriate command:\n- Full pipeline (most common): `python deidentify.py full <file> --locale <code>`\n- Step-by-step (cautious): `python deidentify.py scan <file> --locale <code>` first\n\nAvailable locale codes: `kr` (Korea), `us` (USA), `jp` (Japan), `cn` (China), `de` (Germany),\n`uk` (United Kingdom), `fr` (France), `ca` (Canada), `au` (Australia), `in` (India).\nIf `--locale` is omitted, the script shows an interactive country selection menu.\nUsers can provide a custom locale file via `--locale-file custom.json`.\n\n### Phase 2: Script Execution\n\nGuide the researcher to run the script. The script is located at:\n```\n${CLAUDE_SKILL_DIR}/deidentify.py\n```\n\n**Full pipeline** (recommended for most users):\n```bash\npython ${CLAUDE_SKILL_DIR}/deidentify.py full data.xlsx \\\n    --locale kr \\\n    --output-dir ./deidentified/ \\\n    --auto-accept-safe\n```\n\n**Step-by-step** (for careful review):\n```bash\n# Step 1: Scan\npython ${CLAUDE_SKILL_DIR}/deidentify.py scan data.xlsx --locale kr --output-dir ./deidentified/\n\n# Step 2: Review (interactive)\npython ${CLAUDE_SKILL_DIR}/deidentify.py review ./deidentified/scan_report.json\n\n# Step 3: Apply\npython ${CLAUDE_SKILL_DIR}/deidentify.py apply ./deidentified/reviewed_report.json\n```\n\n**Options:**\n- `--locale CODE`: Country locale for PHI patterns (kr, us, jp, cn, de, uk, fr, ca, au, in)\n- `--locale-file PATH`: Custom locale JSON file (copy `locales/_template.json` to create one)\n- `--auto-accept-safe`: Skip confirmation for columns classified as SAFE (faster for large datasets)\n- `--hash-mapping`: Store SHA-256 hashes instead of original values in mapping file (one-way, more secure)\n- `--output-dir`: Where to save de-identified file, mapping, and audit log\n- `-v/--verbose`: Enable debug logging\n\n### Phase 3: Interactive Review Guidance\n\nThe script's terminal review has three passes:\n\n1. **Pass 1 — Column Classification**: Each column is shown as PHI / REVIEW_NEEDED / SAFE.\n   The researcher confirms or overrides each classification.\n2. **Pass 2 — Undecided Items**: Columns that weren't resolved in Pass 1 get a second look\n   with more sample values displayed.\n3. **Pass 3 — Final Summary**: A table of all planned actions. The researcher can edit\n   individual decisions before confirming.\n\nCoach the researcher. Deliver these prompts in the researcher's preferred language:\n- \"Columns classified as PHI are anonymized by default. Press 'k' to keep the original value.\"\n- \"REVIEW_NEEDED are columns the script could not classify. Check the sample values and decide.\"\n- \"SAFE means no PHI detected. Press 'r' to request re-review if any column looks suspicious.\"\n\n### Phase 4: Verify and Document\n\nAfter the script completes, help the researcher verify:\n\n1. **Read the audit log** (safe — contains only hashes):\n   ```bash\n   cat ./deidentified/audit_log.csv | head -20\n   ```\n   Verify the number of changes, affected columns, and PHI types.\n\n2. **Spot-check the de-identified file** (safe — PHI already removed):\n   Read a few rows to confirm pseudonyms (P0001, etc.), date shifts, and [REDACTED] markers\n   appear where expected.\n\n3. **Check that sensitive columns are actually removed**:\n   Verify no original names, phone numbers, or RRN values remain.\n\n4. **Mapping file security**:\n   - Remind the researcher: \"mapping.json contains original patient identifiers — treat it as restricted.\"\n   - Recommend storing it separately from the de-identified data\n   - File permissions are automatically set to 0600 (owner-only)\n\n### Phase 5: Documentation\n\nGenerate a de-identification methods paragraph for the manuscript or IRB:\n\nTemplate:\n> Protected health information was removed from the dataset prior to analysis using\n> a rule-based de-identification tool (deidentify.py, medsci-skills) with the [COUNTRY]\n> locale pattern pack. The tool scanned column names and cell values using regex patterns\n> for country-specific identifiers (e.g., national ID numbers, phone numbers), email\n> addresses, dates, and addresses. Each column classification was reviewed by the\n> researcher in an interactive terminal session. Names were replaced with pseudonyms\n> (P0001, P0002, ...), dates were shifted by a random per-patient offset (±365 days)\n> preserving relative temporal intervals, and direct identifiers (phone numbers, email\n> addresses, national ID numbers) were suppressed. A total of [N] cells across [M]\n> columns were de-identified. The de-identification mapping file was stored separately\n> under restricted access (file permissions 0600).\n\nCustomize based on the actual audit log statistics.\n\n## Cross-Skill Integration\n\n- **deidentify** sits BEFORE `clean-data` in the research pipeline\n- After de-identification, hand off to `/clean-data` for data quality profiling\n- `/analyze-stats` can safely process the de-identified output\n- `/write-paper` Methods section should reference the de-identification process\n- `/write-protocol` can use the HIPAA/PIPA reference files for protocol documentation\n\n## Output Files\n\n| File | Contains PHI? | Safe for Claude? | Purpose |\n|------|:------------:|:----------------:|---------|\n| `*_deidentified.xlsx/csv` | No | Yes | De-identified data for analysis |\n| `mapping.json` | **YES** | **No** | Original ↔ pseudonym mapping |\n| `audit_log.csv` | No (hashes only) | Yes | What was changed and where |\n| `scan_report.json` | No | Yes | Column classification results |\n| `reviewed_report.json` | No | Yes | Researcher-reviewed classifications |\n\n## Scope and Limitations\n\n**Supported (v1)**:\n- Structured tabular data: CSV, TSV, Excel (.xlsx)\n- 10 country locales with country-specific PHI patterns:\n  - Korea (kr): RRN (주민번호), phone, email, address, Hangul names, dates\n  - USA (us): SSN, US phone, US address, zip codes\n  - Japan (jp): マイナンバー, Japanese phone, 都道府県 address, Kanji names\n  - China (cn): 身份证号, Chinese phone, 省市区 address, Chinese names\n  - Germany (de): Steuer-ID, German phone, Straße address\n  - UK (uk): NHS Number, NI Number, UK phone, postcodes\n  - France (fr): NIR/INSEE, French phone, Rue address\n  - Canada (ca): SIN, Canadian phone, postal codes\n  - Australia (au): TFN, Medicare number, AU phone\n  - India (in): Aadhaar, PAN, Indian phone, pin codes\n- Universal patterns (all locales): email, ISO dates, high-cardinality numeric IDs (MRN)\n- English column names recognized across all locales\n- Custom locale support via `--locale-file` with template\n- Pseudonymization, date shifting, ID replacement, suppression\n\n**NOT supported (planned for v2)**:\n- DICOM image metadata (PS3.15 Annex E) — requires pydicom\n- Clinical free-text NER (clinical notes, radiology reports)\n- Automated k-anonymity / l-diversity assessment\n- SPSS (.sav), SAS (.sas7bdat), or other statistical formats\n\n## Anti-Hallucination\n\n- **Never fabricate file paths, URLs, DOIs, or package names.** Verify existence before recommending.\n- **Never invent journal metadata, impact factors, or submission policies** without verification at the journal's website.\n- If a tool, package, or resource does not exist or you are unsure, say so explicitly rather than guessing.","tags":["deidentify","medsci","skills","aperivue","agent-skills","biostatistics","claude-code","claude-skills","clinical-research","diagnostic-accuracy","irb-protocol","literature-review"],"capabilities":["skill","source-aperivue","skill-deidentify","topic-agent-skills","topic-biostatistics","topic-claude-code","topic-claude-skills","topic-clinical-research","topic-diagnostic-accuracy","topic-irb-protocol","topic-literature-review","topic-manuscript","topic-medical-ai","topic-medical-research","topic-meta-analysis"],"categories":["medsci-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/Aperivue/medsci-skills/deidentify","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add Aperivue/medsci-skills","source_repo":"https://github.com/Aperivue/medsci-skills","install_from":"skills.sh"}},"qualityScore":"0.499","qualityRationale":"deterministic score 0.50 from registry signals: · indexed on github topic:agent-skills · 98 github stars · SKILL.md body (8,769 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:56:29.266Z","embedding":null,"createdAt":"2026-05-13T12:57:44.375Z","updatedAt":"2026-05-18T18:56:29.266Z","lastSeenAt":"2026-05-18T18:56:29.266Z","tsv":"'-20':703 '-256':157,510 '/analyze-stats':980 '/clean-data':975 '/csv':1020 '/deidentified':409,437 '/deidentified/audit_log.csv':701 '/deidentified/reviewed_report.json':458 '/deidentified/scan_report.json':448 '/deidentify.py':389,401,429,446,456 '/references/date_shift_guide.md':213 '/references/hipaa_18_identifiers.md':196 '/references/korean_phi_patterns.md':204 '/write-paper':989 '/write-protocol':999 '0600':794,945 '1':100,246,251,423,556,558,589,690 '10':22,1070 '2':128,261,371,439,577,579,714 '3':142,274,450,544,599,601,744 '3.10':227 '365':901 '4':168,284,678,762 '5':799 'aadhaar':1157 'accept':412,492 'access':942 'across':924,1180 'action':609 'actual':65,750,950 'address':867,870,913,1085,1095,1104,1113,1124,1140 'advis':222 'affect':293,709 'alreadi':166,725 'alway':169 'analysi':12,824,1028 'annex':1207 'anonym':635,1223 'answer':299 'anti':1237 'anti-hallucin':1236 'appear':741 'appli':451,457 'appropri':302 'ask':102,248 'assess':247,1227 'assist':11 'au':34,342,475,1149,1153 'audit':154,536,693,951 'audit_log.csv':1035 'australia':343,1148 'auto':411,491 'auto-accept-saf':410,490 'autom':1220 'automat':791 'avail':322 'base':297,829,947 'bash':396,421,699 'best':216 'ca':33,340,474,1142 'canada':341,1141 'canadian':1144 'cardin':1172 'care':419 'cat':700 'cautious':316 'cell':850,923 'chang':708,1042 'check':654,717,745 'checklist':200 'china':332,1107 'chines':1110,1114 'choic':296 'classif':150,560,576,873,1049,1057 'classifi':498,631,653 'claud':193,201,210,386,398,426,443,453,1016 'clean':962 'clean-data':961 'cli':15 'clinic':5,1211,1216 'cn':29,331,470,1108 'coach':618 'code':324,461,1097,1147,1162 'column':149,497,559,562,582,630,648,674,710,748,847,872,926,1048,1177 'command':303 'common':307 'communic':170 'complet':685 'confirm':495,572,617,732 'contain':112,138,696,770,1012 'content':136 'copi':485 'could':651 'countri':23,355,462,840,857,1071,1075 'country-specif':856,1074 'creat':488 'critic':97 'cross':955 'cross-skil':954 'csv':237,258,1066 'custom':362,481,946,1183 'custom.json':369 'data':7,43,60,96,111,117,127,257,269,787,963,977,1026,1065 'data.xlsx':403,431 'dataset':504,821 'date':214,271,736,868,891,1088,1169,1193 'day':902 'de':3,30,50,62,67,162,281,333,471,531,720,785,804,831,929,933,970,986,996,1024,1117 'de-identif':49,61,66,280,803,830,932,969,995 'de-identifi':2,161,530,719,784,928,985,1023 'debug':541 'decid':659 'decis':615 'default':637 'deidentifi':1,958 'deidentified.xlsx':1019 'deidentified.xlsx/csv':1018 'deidentify.py':309,318,834 'deliv':621 'detect':16,664 'dicom':1203 'dir':195,203,212,388,400,408,428,436,445,455,526 'direct':908 'display':132,598 'divers':1226 'document':283,681,800,1008 'doi':1244 'e':1208 'e.g':860 'edit':613 'email':866,912,1084,1167 'enabl':540 'english':182,1176 'etc':190,260,273,735 'excel':239,259,1068 'execut':373 'exist':1249,1276 'expect':266,743 'explain':85 'explicit':1283 'fabric':1240 'factor':1257 'faster':501 'file':135,192,231,253,295,364,368,479,484,518,533,722,764,788,936,943,1005,1010,1011,1189,1241 'final':602 'first':321 'five':242 'five-phas':241 'format':236,254,1235 'fr':32,338,473,1135 'franc':339,1134 'free':1213 'free-text':1212 'french':1137 'full':304,310,390,402 'generat':801 'german':1121 'germani':334,1116 'get':590 'guess':1286 'guid':55,86,374 'guidanc':547 'hallucin':1238 'hand':972 'hangul':1086 'harbor':189,199 'hash':158,506,511,698,1037 'hash-map':505 'head':702 'health':815 'help':686 'heurist':20 'high':1171 'high-cardin':1170 'hipaa':187,197 'hipaa/pipa':1003 'id':272,862,915,1120,1174,1195 'identif':51,63,68,282,805,832,934,971,997 'identifi':4,163,291,532,721,773,786,859,909,930,987,1025 'imag':1204 'impact':1256 'india':345,1155 'indian':1159 'individu':614 'inform':816 'instal':233 'instead':512 'integr':957 'interact':36,354,441,545,881 'interv':906 'invent':1253 'irb':277,812 'iso':1168 'item':581 'japan':330,1098 'japanes':1101 'journal':1254,1265 'jp':28,329,469,1099 'json':483 'k':639,1222 'k-anonym':1221 'kanji':1105 'keep':641 'kingdom':337 'korea':326,1079 'korean':206 'korean-specif':205 'kr':26,325,405,433,467,1080 'l':1225 'l-divers':1224 'languag':176,629 'larg':503 'later':292 'level':126 'limit':1060 'llm':10,40,80 'llm-assist':9 'local':24,47,118,311,320,323,347,363,367,404,432,460,463,478,482,841,1072,1166,1182,1184,1188 'locale-fil':366,477,1187 'locales/_template.json':486 'locat':384 'log':155,537,542,694,952 'look':593,675 'm':925 'manuscript':810 'map':134,294,507,517,534,763,935,1034 'mapping.json':769,1029 'marker':740 'may':144 'mean':661 'medic':57 'medicar':1151 'medsci':836 'medsci-skil':835 'menu':357 'metadata':1205,1255 'method':806,990 'mrn':1175 'n':922 'name':270,755,848,884,1087,1106,1115,1178,1247 'nation':861,914 'need':121,287,568,646 'ner':1215 'never':101,120,129,1239,1252 'nhs':1127 'ni':1129 'nir/insee':1136 'note':1217 'number':706,757,863,865,911,916,1128,1130,1152 'numer':1173 'offset':900 'omit':349 'one':489,520 'one-way':519 'openpyxl':228,234 'option':459 'origin':139,514,643,754,771,1032 'output':164,407,435,525,988,1009 'output-dir':406,434,524 'overrid':574 'owner':796 'owner-on':795 'p0001':734,889 'p0002':890 'pack':25,843 'packag':1246,1271 'pan':1158 'paragraph':807 'pass':555,557,578,588,600 'past':106 'path':480,1242 'patient':125,772,899 'patient-level':124 'pattern':209,466,842,854,1078,1164 'per':898 'per-pati':897 'perform':70 'permiss':789,944 'phase':243,245,370,543,677,798 'phi':17,95,113,140,165,186,263,465,566,633,663,712,724,1013,1077 'phone':756,864,910,1083,1093,1102,1111,1122,1132,1138,1145,1154,1160 'pin':1161 'pip':232 'pipelin':305,391,967 'plan':608,1200 'polici':1260 'postal':1146 'postcod':1133 'practic':217 'prefer':175,628 'prerequisit':225 'preserv':903 'press':638,665 'prior':822 'process':93,116,179,983,998 'profil':979 'prompt':623 'protect':814 'protocol':1007 'provid':360 'ps3.15':1206 'pseudonym':733,888,1033,1192 'purpos':1017 'pydicom':1210 'python':14,74,226,308,317,397,425,442,452 'qualiti':978 'r':666 'radiolog':1218 'random':896 'rather':1284 'raw':42,94,110,152 're':290,670 're-identifi':289 're-review':669 'read':130,145,218,691,727 'recogn':1179 'recommend':300,392,778,1251 'redact':739 'refer':191,220,993,1004 'regex':19,208,853 'relat':904 'relev':219 'remain':761 'remind':766 'remov':167,726,751,818 'replac':886,1196 'report':148,1219 'request':668 'requir':278,1209 'research':6,58,224,250,376,571,611,620,626,688,768,878,966,1055 'researcher-review':1054 'resolv':586 'resourc':1273 'restrict':777,941 'result':1050 'review':38,420,440,447,546,552,567,645,671,875,1056 'reviewed_report.json':1051 'role':82 'row':730 'rrn':759,1081 'rue':1139 'rule':99,828 'rule-bas':827 'run':46,77,378 'safe':188,198,413,493,500,569,660,695,723,982,1014 'safeti':98 'sampl':596,656 'sas':1230 'sas7bdat':1231 'sav':1229 'save':529 'say':1281 'scan':147,319,424,430,846 'scan_report.json':1045 'scope':1058 'script':45,75,115,351,372,380,382,549,650,684 'second':592 'section':991 'secur':523,765 'see':91,123 'select':356 'sensit':747 'separ':781,939 'session':883 'set':792 'sha':156,509 'shift':215,737,893,1194 'show':107,352 'shown':564 'sin':1143 'sit':959 'skill':52,194,202,211,387,399,427,444,454,837,956 'skill-deidentify' 'skip':494 'source-aperivue' 'specif':207,279,858,1076 'spot':716 'spot-check':715 'spss':1228 'ssn':1091 'standalon':13,73 'statist':953,1234 'step':313,315,415,417,422,438,449 'step-by-step':312,414 'steuer':1119 'steuer-id':1118 'store':508,779,938 'straße':1123 'structur':1063 'submiss':1259 'summari':603 'support':235,1061,1185,1199 'suppress':918,1197 'suspici':676 'tabl':605 'tabular':1064 'technic':184 'templat':813,1191 'tempor':905 'term':185 'termin':37,551,882 'text':1214 'tfn':1150 'three':554 'tool':833,845,1270 'topic-agent-skills' 'topic-biostatistics' 'topic-claude-code' 'topic-claude-skills' 'topic-clinical-research' 'topic-diagnostic-accuracy' 'topic-irb-protocol' 'topic-literature-review' 'topic-manuscript' 'topic-medical-ai' 'topic-medical-research' 'topic-meta-analysis' 'total':920 'touch':41 'treat':774 'tsv':238,1067 'type':713 'uk':31,335,472,1125,1126,1131 'undecid':580 'unit':336 'univers':1163 'unsur':1280 'upload':109 'url':1243 'us':27,327,468,1090,1092,1094 'usa':328,1089 'use':181,825,852,1001 'user':104,173,358,395 'v':538 'v1':1062 'v2':1202 'valu':141,153,515,597,644,657,760,851 'verbos':539 'verif':1262 'verifi':88,679,689,704,752,1248 'via':18,365,1186 'way':521 'websit':1267 'weren':584 'without':48,78,1261 'workflow':244 'xlsx':230,240,1069 'yes':1022,1030,1039,1047,1053 'zip':1096 'マイナンバー':1100 '省市区':1112 '身份证号':1109 '都道府県':1103 '주민번호':1082","prices":[{"id":"4a7b89f6-6569-463e-9044-025df7efc60b","listingId":"38b4a656-1e08-440a-a8ed-2b495a974aaa","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"Aperivue","category":"medsci-skills","install_from":"skills.sh"},"createdAt":"2026-05-13T12:57:44.375Z"}],"sources":[{"listingId":"38b4a656-1e08-440a-a8ed-2b495a974aaa","source":"github","sourceId":"Aperivue/medsci-skills/deidentify","sourceUrl":"https://github.com/Aperivue/medsci-skills/tree/main/skills/deidentify","isPrimary":false,"firstSeenAt":"2026-05-13T12:57:44.375Z","lastSeenAt":"2026-05-18T18:56:29.266Z"}],"details":{"listingId":"38b4a656-1e08-440a-a8ed-2b495a974aaa","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"Aperivue","slug":"deidentify","github":{"repo":"Aperivue/medsci-skills","stars":98,"topics":["agent-skills","biostatistics","claude-code","claude-skills","clinical-research","diagnostic-accuracy","irb-protocol","literature-review","manuscript","medical-ai","medical-research","meta-analysis","physician-researcher","prisma","pubmed","radiology","reporting-guidelines","strobe","systematic-review","tripod-ai"],"license":"other","html_url":"https://github.com/Aperivue/medsci-skills","pushed_at":"2026-05-17T20:50:52Z","description":"Claude Code skills for medical research — literature search, reporting guidelines, statistical analysis, publication figures. Built by a physician-researcher, tested on real publications. MIT licensed.","skill_md_sha":"022b721a3d6fa6aec7889e9a13745f0d851c5beb","skill_md_path":"skills/deidentify/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/Aperivue/medsci-skills/tree/main/skills/deidentify"},"layout":"multi","source":"github","category":"medsci-skills","frontmatter":{"name":"deidentify","description":"De-identify clinical research data before LLM-assisted analysis. Standalone Python CLI detects PHI via regex + heuristics with 10 country locale packs (kr, us, jp, cn, de, uk, fr, ca, au, in). Interactive terminal review. No LLM touches raw data — the script runs locally without any network or AI calls."},"skills_sh_url":"https://skills.sh/Aperivue/medsci-skills/deidentify"},"updatedAt":"2026-05-18T18:56:29.266Z"}}