{"id":"67468c63-91b7-4da6-9844-de2441f142a4","shortId":"dmCkzh","kind":"skill","title":"sn-da-large-file-analysis","tagline":"万行以上 Excel 数据集的高性能分析引擎。提供 openpyxl read_only 流式读取（iter_rows 支持 10 万行以上）、Parquet 转换加速、内存优化、分块处理和大文件写入模式。**遇到以下任一情况就主动使用本 skill**：①数据行数 ≥ 10k（由 sn-da-excel-workflow 的行数评估步骤触发）；②用户出现触发词：大文件 / 大数据量 / 性能优化 / 内存不足 / OOM / 百万行 / 十万行 / 流式读取 / Parquet / 分块处理 / large file / big data / stre","description":"# Large Scale Excel Analysis Skill\n\n## Mandatory Rules\n\n> **When total rows >= 10,000, you MUST use the methods in this skill.**\n\n| Data Scale | Read Strategy | Reason |\n|-----------|---------------|--------|\n| < 10k rows | `pd.read_excel()` directly | No memory pressure |\n| 10k–100k rows | `pd.read_excel()` → convert to Parquet → `pd.read_parquet()` for analysis | Avoid repeated slow reads |\n| 100k–1M rows | **openpyxl `read_only` + `iter_rows` streaming** → Parquet | `pd.read_excel()` will OOM or timeout |\n| > 1M rows | Streaming read + **multi-sheet split** (Excel max 1,048,576 rows per sheet) | Must chunk |\n\n**Prohibited:**\n- Do NOT use `pd.read_excel()` to fully load 100k+ row files\n- Do NOT search for fonts with `fc-list`, `find ... fonts`, or install packages with `pip install`\n- Do NOT use `df.iterrows()` on large DataFrames (use `itertuples()` or vectorized ops)\n- Do NOT use `df.apply(lambda...)` for operations that can be vectorized\n\n---\n\n## Environment Setup\n\n```python\nimport pandas as pd\nimport numpy as np\nimport os\nimport gc\n\npd.options.mode.copy_on_write = True\n\n# CJK font setup (fixed paths — do NOT search for fonts)\n# ⚠️ Copy this block as-is. Do NOT use fc-list, find, subprocess, or glob to locate fonts.\nimport matplotlib\nimport matplotlib.pyplot as plt\nimport matplotlib.font_manager as fm\n\n_FONT_PATHS = [\n    '/mnt/afs_agents/SimHei.ttf',\n    '/mnt/afs_agents/mnt/data/SimHei.ttf',\n    os.path.expanduser('~/.fonts/SimHei.ttf'),\n    '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc',\n    '/usr/share/fonts/SimHei.ttf',\n]\nfor _p in _FONT_PATHS:\n    if os.path.exists(_p):\n        fm.fontManager.addfont(_p)\n        matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name()\n        break\nmatplotlib.rcParams['axes.unicode_minus'] = False\n```\n\n---\n\n## Core Method 1: Inspect File Structure (Without Loading Data)\n\nBefore any operation on a large file, inspect sheets and row counts **without loading data into memory**:\n\n```python\nimport openpyxl\n\ndef inspect_excel(file_path):\n    \"\"\"Stream-inspect Excel structure. Returns {sheet_name: {rows, columns}}.\"\"\"\n    wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)\n    info = {}\n    for name in wb.sheetnames:\n        ws = wb[name]\n        row_count = 0\n        header = None\n        for i, row in enumerate(ws.iter_rows(values_only=True)):\n            if i == 0:\n                header = [str(c) if c is not None else f\"Col_{j}\" for j, c in enumerate(row)]\n            else:\n                row_count += 1\n        info[name] = {\"rows\": row_count, \"columns\": header}\n    wb.close()\n    return info\n\n# Usage\nfile_info = inspect_excel(file_path)\nfor sheet, meta in file_info.items():\n    print(f\"Sheet '{sheet}': {meta['rows']} rows, {len(meta['columns'])} cols\")\n    print(f\"  Columns: {meta['columns'][:10]}...\")\ntotal_rows = sum(m['rows'] for m in file_info.values())\nprint(f\"Total rows: {total_rows}\")\n```\n\n---\n\n## Core Method 2: Streaming Read → Parquet (100k+ Rows)\n\nFor 100k+ row files, **never** use `pd.read_excel()`. Use openpyxl streaming → Parquet:\n\n```python\nimport openpyxl\nimport pyarrow as pa\nimport pyarrow.parquet as pq\n\ndef stream_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000):\n    \"\"\"Stream Excel rows to Parquet with constant memory usage.\n\n    All columns are cast to string to avoid cross-chunk schema mismatches\n    (Excel mixed-type columns may be all-None in some chunks, causing PyArrow\n    to infer null type instead of string). Convert numeric columns after loading\n    Parquet with pd.to_numeric() as needed.\n    \"\"\"\n    wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)\n    ws = wb[sheet_name] if sheet_name else wb.active\n\n    header = None\n    writer = None\n    chunk_rows = []\n    total_written = 0\n\n    def _flush(rows):\n        nonlocal writer\n        table = pa.table({\n            col: pa.array(\n                [str(r[idx]) if r[idx] is not None else None for r in rows],\n                type=pa.string(),\n            )\n            for idx, col in enumerate(header)\n        })\n        if writer is None:\n            writer = pq.ParquetWriter(parquet_path, table.schema)\n        writer.write_table(table)\n\n    for i, row in enumerate(ws.iter_rows(values_only=True)):\n        if i == 0:\n            header = [str(c) if c is not None else f\"Col_{j}\" for j, c in enumerate(row)]\n            continue\n\n        chunk_rows.append(list(row))\n\n        if len(chunk_rows) >= chunk_size:\n            _flush(chunk_rows)\n            total_written += len(chunk_rows)\n            print(f\"  Written {total_written:,} rows...\")\n            chunk_rows = []\n            gc.collect()\n\n    if chunk_rows:\n        _flush(chunk_rows)\n        total_written += len(chunk_rows)\n\n    if writer:\n        writer.close()\n    wb.close()\n    print(f\"Done: {total_written:,} rows -> {parquet_path}\")\n    return total_written\n```\n\n---\n\n## Core Method 3: Medium File Parquet Conversion (10k–100k Rows)\n\nFor 10k–100k rows, `pd.read_excel()` won't OOM, but Parquet is much faster for repeated analysis:\n\n```python\ndef convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0):\n    \"\"\"Medium file: pd.read_excel -> Parquet cache.\"\"\"\n    if os.path.exists(parquet_path):\n        print(f\"Cache exists: {parquet_path}\")\n        return\n    df = pd.read_excel(excel_path, sheet_name=sheet_name)\n    df.columns = df.columns.astype(str)\n    df.to_parquet(parquet_path, engine='pyarrow', compression='snappy')\n    row_count = len(df)\n    del df\n    gc.collect()\n    print(f\"Converted {row_count:,} rows -> {parquet_path}\")\n```\n\n---\n\n## Core Method 4: Memory Optimization (Type Downcasting)\n\nAfter loading Parquet, further reduce memory footprint:\n\n```python\ndef optimize_dtypes(df):\n    \"\"\"Auto-downcast numeric types + convert low-cardinality strings to Category.\n    Typically saves 50-80% memory.\"\"\"\n    start_mb = df.memory_usage(deep=True).sum() / 1024**2\n\n    for col in df.select_dtypes(include=['int64', 'int32']).columns:\n        c_min, c_max = df[col].min(), df[col].max()\n        if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:\n            df[col] = df[col].astype(np.int8)\n        elif c_min >= np.iinfo(np.int16).min and c_max <= np.iinfo(np.int16).max:\n            df[col] = df[col].astype(np.int16)\n        elif c_min >= np.iinfo(np.int32).min and c_max <= np.iinfo(np.int32).max:\n            df[col] = df[col].astype(np.int32)\n\n    for col in df.select_dtypes(include=['float64']).columns:\n        df[col] = df[col].astype(np.float32)\n\n    for col in df.select_dtypes(include=['object', 'string']).columns:\n        if df[col].nunique() / max(len(df), 1) < 0.5:\n            df[col] = df[col].astype('category')\n\n    end_mb = df.memory_usage(deep=True).sum() / 1024**2\n    print(f\"Memory: {start_mb:.1f} MB -> {end_mb:.1f} MB (saved {(1 - end_mb/start_mb)*100:.0f}%)\")\n    return df\n```\n\n---\n\n## Core Method 5: Large File Writing\n\n```python\ndef write_large_excel(df, output_path, sheet_name=\"Sheet1\"):\n    \"\"\"Auto-select write strategy based on data size.\"\"\"\n    total_cells = len(df) * len(df.columns)\n\n    if len(df) > 1_000_000:\n        csv_path = output_path.rsplit('.', 1)[0] + '.csv'\n        df.to_csv(csv_path, index=False)\n        print(f\"Over 1M rows — exported as CSV: {csv_path}\")\n        return csv_path\n\n    if total_cells > 50_000:\n        from openpyxl import Workbook\n        from openpyxl.cell import WriteOnlyCell\n\n        wb = Workbook(write_only=True)\n        ws = wb.create_sheet(title=sheet_name)\n        ws.append(list(df.columns))\n        for idx, row in enumerate(df.itertuples(index=False)):\n            ws.append([None if pd.isna(v) else v for v in row])\n            if (idx + 1) % 100_000 == 0:\n                print(f\"  Written {idx + 1:,} rows...\")\n        wb.save(output_path)\n        wb.close()\n        print(f\"write_only mode: {len(df):,} rows -> {output_path}\")\n    else:\n        df.to_excel(output_path, index=False, sheet_name=sheet_name)\n        print(f\"Standard write: {len(df):,} rows -> {output_path}\")\n    return output_path\n```\n\n---\n\n## Example 1: 100k-Row Table — Column Distribution + Chart\n\n**Scenario**: User has a 100k-row sales Excel file and wants regional sales distribution with a bar chart.\n\n```python\nimport pandas as pd\nimport os, gc\n\nexcel_path = \"sales_100k.xlsx\"\nparquet_path = \"sales_100k.parquet\"\n\n# === Step 1: Inspect structure ===\nfile_info = inspect_excel(excel_path)\ntotal_rows = sum(m['rows'] for m in file_info.values())\nprint(f\"Total rows: {total_rows}\")\n\n# === Step 2: Choose read strategy by row count ===\nif total_rows >= 100_000:\n    stream_excel_to_parquet(excel_path, parquet_path)\nelse:\n    convert_excel_to_parquet(excel_path, parquet_path)\n\n# === Step 3: Load Parquet + optimize memory ===\ndf = pd.read_parquet(parquet_path)\ndf = optimize_dtypes(df)\nprint(f\"Shape: {df.shape}\")\nprint(df.head(3))\n\n# === Step 4: Analysis ===\nregion_sales = df.groupby('Region')['Sales'].sum().sort_values(ascending=False)\nprint(region_sales)\n\n# === Step 5: Visualization ===\nfig, ax = plt.subplots(figsize=(10, 6))\nregion_sales.plot(kind='bar', ax=ax, color='#4C72B0')\nax.set_title('Sales by Region')\nax.set_ylabel('Sales')\nplt.tight_layout()\nplt.savefig('region_sales.png', dpi=150, bbox_inches='tight')\nplt.show()\n\n# === Step 6: Cleanup ===\ndel df\ngc.collect()\n```\n\n---\n\n## Example 2: 1M-Row Table — Streaming Read + Filter + Export\n\n**Scenario**: User has a 1M-row transaction log and wants records with amount > 10,000 exported.\n\n```python\nimport pandas as pd\nimport os, gc\n\nexcel_path = \"transactions_1m.xlsx\"\nparquet_path = \"transactions_1m.parquet\"\n\n# === Step 1: Stream to Parquet (1M rows — MUST use streaming, never pd.read_excel) ===\nstream_excel_to_parquet(excel_path, parquet_path, chunk_size=50000)\n\n# === Step 2: Load only needed columns (saves memory) ===\ndf = pd.read_parquet(parquet_path, columns=['TransactionID', 'Amount', 'Date', 'Type'])\ndf = optimize_dtypes(df)\nprint(f\"Shape: {df.shape}, Memory: {df.memory_usage(deep=True).sum()/1024**2:.1f} MB\")\n\n# === Step 3: Vectorized filtering (never use apply/iterrows) ===\nmask = df['Amount'] > 10000\nhigh_value = df[mask].copy()\nprint(f\"Filtered: {len(high_value):,} / {len(df):,} rows\")\n\n# === Step 4: Export ===\noutput_path = write_large_excel(high_value, 'high_value_transactions.xlsx')\n\n# === Step 5: Cleanup ===\ndel df, high_value\ngc.collect()\n```\n\n---\n\n## Vectorized Operations Cheat Sheet\n\nOn large files, **never use slow operations** — use vectorized alternatives:\n\n| Slow (Prohibited) | Fast (Use This) |\n|-------------------|-----------------|\n| `df.apply(lambda x: x*2)` | `df['col'] * 2` |\n| `df.iterrows()` | `df.itertuples(index=False)` |\n| `for i in range(len(df)): df.iloc[i]` | Vectorized boolean indexing `df[mask]` |\n| `df['a'].map(lambda x: 'Y' if x>0 else 'N')` | `np.where(df['a']>0, 'Y', 'N')` |\n| `df.groupby('a').apply(custom_func)` | `df.groupby('a').agg({'b':'sum','c':'mean'})` |\n\n---\n\n## Memory Estimation\n\nEstimate memory before loading to avoid OOM:\n\n```\nEstimated MB ≈ rows × cols × 8 / 1024² (numeric columns)\nEstimated MB ≈ rows × cols × 50 / 1024² (with text columns)\n```\n\n| Rows | 20 cols (numeric) | 20 cols (with text) |\n|------|-------------------|---------------------|\n| 100k | ~15 MB | ~95 MB |\n| 500k | ~76 MB | ~477 MB |\n| 1M | ~153 MB | ~953 MB |\n\nWhen estimated memory exceeds 80% of available RAM, use **column-selective loading** (`pd.read_parquet(columns=[...])`) or chunked processing.\n\n---\n\n## Best Practices\n\n1. **Parquet is king**: For any file >= 10k rows, convert to Parquet before analysis. Parquet supports columnar reads, compressed storage, and loads 10-50x faster than xlsx.\n2. **Streaming is the safety net**: For 100k+ rows, always use openpyxl `read_only` + `iter_rows`. Never `pd.read_excel()` for full load.\n3. **Release memory promptly**: `del df; gc.collect()` after every intermediate DataFrame.\n4. **Excel row limit**: Max 1,048,576 rows per sheet. Auto-split to multiple sheets or export as CSV when exceeded.\n5. **Use write_only for output**: Files with >50k cells must use `openpyxl Workbook(write_only=True)`.","tags":["large","file","analysis","sensenova","skills","opensensenova","agent","agent-skills","ai-agents","ai-assistant","data-analysis","document-processing"],"capabilities":["skill","source-opensensenova","skill-sn-da-large-file-analysis","topic-agent","topic-agent-skills","topic-ai-agents","topic-ai-assistant","topic-data-analysis","topic-document-processing","topic-office-automation","topic-presentation-slides"],"categories":["SenseNova-Skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/OpenSenseNova/SenseNova-Skills/sn-da-large-file-analysis","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add OpenSenseNova/SenseNova-Skills","source_repo":"https://github.com/OpenSenseNova/SenseNova-Skills","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 1627 github stars · SKILL.md body (12,117 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-05-18T18:53:04.558Z","embedding":null,"createdAt":"2026-05-15T06:53:09.721Z","updatedAt":"2026-05-18T18:53:04.558Z","lastSeenAt":"2026-05-18T18:53:04.558Z","tsv":"'-50':1610 '-80':816 '/.fonts/simhei.ttf':250 '/1024':1396 '/mnt/afs_agents/mnt/data/simhei.ttf':248 '/mnt/afs_agents/simhei.ttf':247 '/usr/share/fonts/simhei.ttf':252 '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc':251 '0':340,355,561,618,729,1008,1080,1496,1502 '0.5':931 '000':62,1002,1003,1033,1079,1203,1324 '048':127,1654 '0f':963 '1':126,277,377,930,959,1001,1007,1077,1085,1125,1167,1341,1587,1653 '10':18,61,416,1266,1323,1609 '100':962,1078,1202 '10000':1410 '100k':85,100,143,438,441,698,702,1127,1138,1551,1622 '100k-row':1126,1137 '1024':825,945,1531,1539 '10k':27,76,84,697,701,1594 '15':1552 '150':1288 '153':1562 '1f':952,956,1398 '1m':101,116,1019,1302,1314,1345,1561 '1m-row':1301,1313 '2':434,826,946,1192,1300,1365,1397,1467,1470,1615 '20':1544,1547 '3':692,1222,1242,1401,1637 '4':784,1244,1426,1648 '477':1559 '4c72b0':1274 '5':968,1260,1437,1671 '50':815,1032,1538 '50000':477,1363 '500k':1556 '50k':1679 '576':128,1655 '6':1267,1294 '76':1557 '8':1530 '80':1570 '95':1554 '953':1564 'agg':1512 'all-non':507 'altern':1457 'alway':1624 'amount':1322,1379,1409 'analysi':6,54,95,716,1245,1600 'appli':1507 'apply/iterrows':1406 'as-i':218 'ascend':1254 'astyp':862,880,898,912,936 'auto':802,984,1660 'auto-downcast':801 'auto-select':983 'auto-split':1659 'avail':1572 'avoid':96,494,1524 'ax':1263,1271,1272 'ax.set':1275,1280 'axes.unicode':272 'b':1513 'bar':1150,1270 'base':988 'bbox':1289 'best':1585 'big':48 'block':217 'boolean':1484 'break':270 'c':358,360,370,621,623,633,836,838,847,853,865,871,883,889,1515 'cach':735,742 'cardin':809 'cast':490 'categori':812,937 'caus':513 'cell':993,1031,1680 'chart':1132,1151 'cheat':1446 'choos':1193 'chunk':133,475,497,512,557,643,645,648,653,661,665,668,673,1361,1583 'chunk_rows.append':638 'cjk':205 'cleanup':1295,1438 'col':366,410,569,590,629,828,841,844,859,861,877,879,895,897,901,909,911,915,925,933,935,1469,1529,1537,1545,1548 'color':1273 'column':318,383,409,413,415,488,504,524,835,907,922,1130,1369,1377,1533,1542,1576,1581 'column-select':1575 'columnar':1603 'compress':765,1605 'constant':484 'continu':637 'convers':696 'convert':89,522,719,776,806,1213,1596 'copi':215,1415 'core':275,432,690,782,966 'count':295,339,376,382,768,778,1198 'cross':496 'cross-chunk':495 'csv':1004,1009,1011,1012,1023,1024,1027,1668 'custom':1508 'da':3,31 'data':49,71,283,298,327,541,990 'datafram':169,1647 'date':1380 'deep':822,942,1393 'def':304,463,562,718,797,973 'del':771,1296,1439,1641 'df':747,770,772,800,840,843,858,860,876,878,894,896,908,910,924,929,932,934,965,977,995,1000,1097,1117,1227,1232,1235,1297,1372,1382,1385,1408,1413,1423,1440,1468,1480,1486,1488,1500,1642 'df.apply':178,1463 'df.columns':756,997,1055 'df.columns.astype':757 'df.groupby':1248,1505,1510 'df.head':1241 'df.iloc':1481 'df.iterrows':166,1471 'df.itertuples':1061,1472 'df.memory':820,940,1391 'df.select':830,903,917 'df.shape':1239,1389 'df.to':759,1010,1102 'direct':80 'distribut':1131,1147 'done':681 'downcast':788,803 'dpi':1287 'dtype':799,831,904,918,1234,1384 'elif':864,882 'els':364,374,551,580,627,1069,1101,1212,1497 'end':938,954,960 'engin':763 'enumer':347,372,592,610,635,1060 'environ':186 'estim':1518,1519,1526,1534,1567 'everi':1645 'exampl':1124,1299 'exceed':1569,1670 'excel':8,32,53,79,88,111,124,139,306,312,392,447,465,468,479,500,536,705,720,723,733,749,750,976,1103,1141,1160,1173,1174,1205,1208,1214,1217,1334,1352,1354,1357,1432,1633,1649 'exist':743 'export':1021,1308,1325,1427,1666 'f':365,401,412,427,628,656,680,741,775,948,1017,1082,1092,1113,1186,1237,1387,1417 'fals':274,1015,1063,1107,1255,1474 'fast':1460 'faster':713,1612 'fc':153,225 'fc-list':152,224 'fig':1262 'figsiz':1265 'file':5,47,145,279,290,307,322,389,393,443,694,731,970,1142,1170,1450,1593,1677 'file_info.items':399 'file_info.values':425,1184 'filter':1307,1403,1418 'find':155,227 'fix':208 'float64':906 'flush':563,647,667 'fm':244 'fm.fontmanager.addfont':261 'fm.fontproperties':265 'fname':266 'font':150,156,206,214,233,245,256 'font.family':264 'footprint':795 'full':1635 'fulli':141 'func':1509 'gc':200,1159,1333 'gc.collect':663,773,1298,1443,1643 'get':268 'glob':230 'header':341,356,384,553,593,619 'high':1411,1420,1433,1441 'high_value_transactions.xlsx':1435 'idx':573,576,589,1057,1076,1084 'import':189,193,197,199,234,236,240,302,453,455,459,1036,1040,1153,1157,1327,1331 'inch':1290 'includ':832,905,919 'index':1014,1062,1106,1473,1485 'infer':516 'info':330,378,387,390,1171 'inspect':278,291,305,311,391,1168,1172 'instal':158,162 'instead':519 'int32':834 'int64':833 'intermedi':1646 'iter':15,106,1629 'itertupl':171 'j':367,369,630,632 'kind':1269 'king':1590 'lambda':179,1464,1491 'larg':4,46,51,168,289,969,975,1431,1449 'layout':1284 'len':407,642,652,672,769,928,994,996,999,1096,1116,1419,1422,1479 'limit':1651 'list':154,226,639,1054 'load':142,282,297,526,790,1223,1366,1522,1578,1608,1636 'locat':232 'log':1317 'low':808 'low-cardin':807 'm':420,423,1179,1182 'manag':242 'mandatori':56 'map':1490 'mask':1407,1414,1487 'matplotlib':235 'matplotlib.font':241 'matplotlib.pyplot':237 'matplotlib.rcparams':263,271 'max':125,839,845,854,857,872,875,890,893,927,1652 'may':505 'mb':819,939,951,953,955,957,1399,1527,1535,1553,1555,1558,1560,1563,1565 'mb/start_mb':961 'mean':1516 'medium':693,730 'memori':82,300,485,785,794,817,949,1226,1371,1390,1517,1520,1568,1639 'meta':397,404,408,414 'method':67,276,433,691,783,967 'min':837,842,848,851,866,869,884,887 'minus':273 'mismatch':499 'mix':502 'mixed-typ':501 'mode':1095 'much':712 'multi':121 'multi-sheet':120 'multipl':1663 'must':64,132,1347,1681 'n':1498,1504 'name':269,316,332,337,379,473,547,550,728,753,755,981,1052,1109,1111 'need':532,1368 'net':1620 'never':444,1350,1404,1451,1631 'none':342,363,474,509,554,556,579,581,597,626,1065 'nonloc':565 'np':196 'np.float32':913 'np.iinfo':849,855,867,873,885,891 'np.int16':868,874,881 'np.int32':886,892,899 'np.int8':850,856,863 'np.where':1499 'null':517 'numer':523,530,804,1532,1546 'numpi':194 'nuniqu':926 'object':920 'oom':40,113,708,1525 'op':174 'openpyxl':11,103,303,449,454,1035,1626,1683 'openpyxl.cell':1039 'openpyxl.load':320,534 'oper':181,286,1445,1454 'optim':786,798,1225,1233,1383 'os':198,1158,1332 'os.path.exists':259,737 'os.path.expanduser':249 'output':978,1088,1099,1104,1119,1122,1428,1676 'output_path.rsplit':1006 'p':254,260,262,267 'pa':458 'pa.array':570 'pa.string':587 'pa.table':568 'packag':159 'panda':190,1154,1328 'parquet':20,44,91,93,109,437,451,467,470,482,527,600,685,695,710,722,725,734,738,744,760,761,780,791,1163,1207,1210,1216,1219,1224,1229,1230,1337,1344,1356,1359,1374,1375,1580,1588,1598,1601 'path':209,246,257,308,323,394,469,471,537,601,686,724,726,739,745,751,762,781,979,1005,1013,1025,1028,1089,1100,1105,1120,1123,1161,1164,1175,1209,1211,1218,1220,1231,1335,1338,1358,1360,1376,1429 'pd':192,1156,1330 'pd.isna':1067 'pd.options.mode.copy':201 'pd.read':78,87,92,110,138,446,704,732,748,1228,1351,1373,1579,1632 'pd.to':529 'per':130,1657 'pip':161 'plt':239 'plt.savefig':1285 'plt.show':1292 'plt.subplots':1264 'plt.tight':1283 'pq':462 'pq.parquetwriter':599 'practic':1586 'pressur':83 'print':400,411,426,655,679,740,774,947,1016,1081,1091,1112,1185,1236,1240,1256,1386,1416 'process':1584 'prohibit':134,1459 'prompt':1640 'pyarrow':456,514,764 'pyarrow.parquet':460 'python':188,301,452,717,796,972,1152,1326 'r':572,575,583 'ram':1573 'rang':1478 'read':12,73,99,104,119,324,436,538,1194,1306,1604,1627 'reason':75 'record':1320 'reduc':793 'region':1145,1246,1249,1257,1279 'region_sales.plot':1268 'region_sales.png':1286 'releas':1638 'repeat':97,715 'return':314,386,687,746,964,1026,1121 'row':16,60,77,86,102,107,117,129,144,294,317,338,345,349,373,375,380,381,405,406,418,421,429,431,439,442,480,558,564,585,608,612,636,640,644,649,654,660,662,666,669,674,684,699,703,767,777,779,1020,1058,1074,1086,1098,1118,1128,1139,1177,1180,1188,1190,1197,1201,1303,1315,1346,1424,1528,1536,1543,1595,1623,1630,1650,1656 'rule':57 'safeti':1619 'sale':1140,1146,1247,1250,1258,1277,1282 'sales_100k.parquet':1165 'sales_100k.xlsx':1162 'save':814,958,1370 'scale':52,72 'scenario':1133,1309 'schema':498 'search':148,212 'select':985,1577 'setup':187,207 'shape':1238,1388 'sheet':122,131,292,315,396,402,403,472,546,549,727,752,754,980,1049,1051,1108,1110,1447,1658,1664 'sheet1':982 'size':476,646,991,1362 'skill':25,55,70 'skill-sn-da-large-file-analysis' 'slow':98,1453,1458 'sn':2,30 'sn-da-excel-workflow':29 'sn-da-large-file-analysi':1 'snappi':766 'sort':1252 'source-opensensenova' 'split':123,1661 'standard':1114 'start':818,950 'step':1166,1191,1221,1243,1259,1293,1340,1364,1400,1425,1436 'storag':1606 'str':357,571,620,758 'strategi':74,987,1195 'stre':50 'stream':108,118,310,435,450,464,478,1204,1305,1342,1349,1353,1616 'stream-inspect':309 'string':492,521,810,921 'structur':280,313,1169 'subprocess':228 'sum':419,824,944,1178,1251,1395,1514 'support':1602 'tabl':567,604,605,1129,1304 'table.schema':602 'text':1541,1550 'tight':1291 'timeout':115 'titl':1050,1276 'topic-agent' 'topic-agent-skills' 'topic-ai-agents' 'topic-ai-assistant' 'topic-data-analysis' 'topic-document-processing' 'topic-office-automation' 'topic-presentation-slides' 'total':59,417,428,430,559,650,658,670,682,688,992,1030,1176,1187,1189,1200 'transact':1316 'transactionid':1378 'transactions_1m.parquet':1339 'transactions_1m.xlsx':1336 'true':204,326,329,352,540,543,615,823,943,1046,1394,1687 'type':503,518,586,787,805,1381 'typic':813 'usag':388,486,821,941,1392 'use':65,137,165,170,177,223,445,448,1348,1405,1452,1455,1461,1574,1625,1672,1682 'user':1134,1310 'v':1068,1070,1072 'valu':350,613,1253,1412,1421,1434,1442 'vector':173,185,1402,1444,1456,1483 'visual':1261 'want':1144,1319 'wb':319,336,533,545,1042 'wb.active':552 'wb.close':385,678,1090 'wb.create':1048 'wb.save':1087 'wb.sheetnames':334 'without':281,296 'won':706 'workbook':321,535,1037,1043,1684 'workflow':33 'write':203,971,974,986,1044,1093,1115,1430,1673,1685 'writeonlycel':1041 'writer':555,566,595,598,676 'writer.close':677 'writer.write':603 'written':560,651,657,659,671,683,689,1083 'ws':335,544,1047 'ws.append':1053,1064 'ws.iter':348,611 'x':1465,1466,1492,1495,1611 'xlsx':1614 'y':1493,1503 'ylabel':1281 '万行以上':7,19 '内存不足':39 '内存优化':22 '分块处理':45 '分块处理和大文件写入模式':23 '十万行':42 '大数据量':37 '大文件':36 '性能优化':38 '提供':10 '支持':17 '数据行数':26 '数据集的高性能分析引擎':9 '流式读取':14,43 '用户出现触发词':35 '由':28 '百万行':41 '的行数评估步骤触发':34 '转换加速':21 '遇到以下任一情况就主动使用本':24","prices":[{"id":"07292170-b56e-4e74-9f41-4fb57e2210e9","listingId":"67468c63-91b7-4da6-9844-de2441f142a4","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"OpenSenseNova","category":"SenseNova-Skills","install_from":"skills.sh"},"createdAt":"2026-05-15T06:53:09.721Z"}],"sources":[{"listingId":"67468c63-91b7-4da6-9844-de2441f142a4","source":"github","sourceId":"OpenSenseNova/SenseNova-Skills/sn-da-large-file-analysis","sourceUrl":"https://github.com/OpenSenseNova/SenseNova-Skills/tree/main/skills/sn-da-large-file-analysis","isPrimary":false,"firstSeenAt":"2026-05-15T06:53:09.721Z","lastSeenAt":"2026-05-18T18:53:04.558Z"}],"details":{"listingId":"67468c63-91b7-4da6-9844-de2441f142a4","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"OpenSenseNova","slug":"sn-da-large-file-analysis","github":{"repo":"OpenSenseNova/SenseNova-Skills","stars":1627,"topics":["agent","agent-skills","ai-agents","ai-assistant","data-analysis","document-processing","office-automation","presentation-slides"],"license":"mit","html_url":"https://github.com/OpenSenseNova/SenseNova-Skills","pushed_at":"2026-05-15T04:43:37Z","description":"Modular SenseNova skills for building AI-powered office assistants and productivity workflows","skill_md_sha":"f6417b991bf0a2a156d8df10875dd4481e80506b","skill_md_path":"skills/sn-da-large-file-analysis/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/OpenSenseNova/SenseNova-Skills/tree/main/skills/sn-da-large-file-analysis"},"layout":"multi","source":"github","category":"SenseNova-Skills","frontmatter":{"name":"sn-da-large-file-analysis","description":"万行以上 Excel 数据集的高性能分析引擎。提供 openpyxl read_only 流式读取（iter_rows 支持 10 万行以上）、Parquet 转换加速、内存优化、分块处理和大文件写入模式。**遇到以下任一情况就主动使用本 skill**：①数据行数 ≥ 10k（由 sn-da-excel-workflow 的行数评估步骤触发）；②用户出现触发词：大文件 / 大数据量 / 性能优化 / 内存不足 / OOM / 百万行 / 十万行 / 流式读取 / Parquet / 分块处理 / large file / big data / streaming read / chunked processing；③直接使用 pd.read_excel() 导致超时或内存溢出；④用户明确要求对大规模数据集进行高性能处理。仅不用于：小于 10k 行的常规 Excel 分析（使用 sn-da-excel-workflow 即可）。"},"skills_sh_url":"https://skills.sh/OpenSenseNova/SenseNova-Skills/sn-da-large-file-analysis"},"updatedAt":"2026-05-18T18:53:04.558Z"}}