Skillquality 0.70

sn-da-excel-workflow

Excel 数据分析多步编排器。覆盖:(1) 读取多 Sheet Excel 文件并统计行数,(2) 大文件检测(≥10k 行自动 Parquet 优化),(3) 数据清洗(缺失值、文本标准化、无效字符),(4) 条件筛选与分类提取,(5) 跨 Sheet 统计聚合,(6) 导出 Excel/CSV 并提供下载链接。覆盖从数据读取到报告生成全流程,按步骤编排 capability 子 skill。**遇到以下任一情况就主动使用本 skill,不要自行写几行 pandas 就回答**:①用户出现触发词:Excel 分析 / 表格分析 / 数据分析 / 数据

Price
free
Protocol
skill
Verified
no

What it does

Excel Data Analysis Workflow

End-to-end workflow for structured Excel analysis. Each step maps to a capability sub-skill that can be loaded for detailed patterns.

Workflow

Step 1 — Count rows across all sheets (lightweight, no full load)

Count rows per sheet without loading data into memory. Use openpyxl read_only mode — this works for any file size.

import openpyxl, gc

wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
total_rows = 0
sheet_info = {}
for name in wb.sheetnames:
    ws = wb[name]
    row_count = sum(1 for _ in ws.iter_rows(min_row=2, values_only=True))
    total_rows += row_count
    sheet_info[name] = row_count
    print(f"Sheet '{name}': {row_count} rows")
wb.close()
print(f"总行数={total_rows}")

⚠️ Do NOT use pd.read_excel() to count rows — it loads all data into memory, which will OOM on large files.

→ capability: excel-reading/multi-sheet-reading

Step 2 — Large file gate (CRITICAL — choose strategy by row count)

total_rowsStrategyWhat to do
< 10kDirect readdf = pd.read_excel(file_path, sheet_name=target_sheet)
10k – 100kParquet cachepd.read_excel() once → df.to_parquet() → all later reads from Parquet
>= 100kSTOP. Load sn-da-large-file-analysis skillRead its SKILL.md, then follow its streaming read + Parquet pattern. Do NOT use pd.read_excel() at all — it will OOM or timeout on 100k+ rows.

For >= 100k rows:

read_file(path="<skills_base>/sn-da-large-file-analysis/SKILL.md")

Then use stream_excel_to_parquet() from that skill — it reads via openpyxl iter_rows in 50k-row chunks with constant memory.

For 10k – 100k rows (only):

import pandas as pd
parquet_path = "/tmp/_auto_parquet.parquet"
df = pd.read_excel(file_path, sheet_name=target_sheet)
df.to_parquet(parquet_path, engine="pyarrow")
del df; gc.collect()
df = pd.read_parquet(parquet_path)

→ capability: excel-reading/large-excel-reading

Step 3 — Inspect schema & data types

Preview target sheet structure. For large files (>= 10k rows), only read a small sample — never full load just to inspect.

# For any file size — read only first N rows for inspection
df_head = pd.read_excel(file_path, sheet_name=target_sheet, nrows=20)
print(f"Columns: {df_head.columns.tolist()}")
print(f"Dtypes:\n{df_head.dtypes}")
print(df_head.head(10))

→ capability: excel-reading/range-reading

Step 4 — Data cleaning

Handle missing values, normalize text, clean invalid characters.

# Missing values
null_count = df[col].isna().sum()

# Text cleaning: keep only Chinese characters
import re
def clean_text(val):
    if pd.isna(val): return val
    return "".join(re.findall(r"[\u4e00-\u9fff]", str(val))) or ""

df[col] = df[col].apply(clean_text)

⚠️ Large file rule: When total_rows >= 100k, do NOT use df.apply(lambda...). Use vectorized operations or np.where() instead. See sn-da-large-file-analysis skill for the vectorized cheat sheet.

→ capabilities:

  • excel-data-cleaning/missing-value-handling
  • excel-data-cleaning/invalid-data-cleaning
  • excel-data-cleaning/text-normalization

Step 5 — Filter & extract

Apply condition or category filters, aggregate results.

# Condition filter
mask = df[col].astype(str).str.strip() == target_value
filtered = df[mask]

# Category extraction (for headerless layouts)
df_raw = pd.read_excel(file_path, sheet_name=sheet, header=None)
# Walk rows to find category markers, collect items until next marker

→ capabilities:

  • excel-data-filtering/condition-filtering
  • excel-data-filtering/category-filtering
  • excel-data-filtering/threshold-filtering

Step 6 — Export results

Save filtered/cleaned data as Excel or CSV. Provide download link.

output_path = "/mnt/data/result.xlsx"
result_df.to_excel(output_path, index=False)
print(f"[Download](sandbox:{output_path})")

→ capabilities:

  • excel-result-export/single-sheet-export
  • excel-result-export/formatted-export

Key rules

  • Always count rows first — gate large-file logic on the 10k threshold.
  • >= 100k rows → MUST load sn-da-large-file-analysis skill — do not attempt to handle with pd.read_excel().
  • Column names may contain spaces (e.g. '是否通 过') — use exact string indexing.
  • Headerless sheets — use header=None and positional indexing.
  • Prohibited on large files (>= 100k rows):
    • pd.read_excel() for full load (use streaming read → Parquet)
    • df.apply(lambda...) or df.iterrows() (use vectorized ops or itertuples())
    • fc-list, find ... fonts, subprocess to search fonts, or pip install (use fixed font paths below)
    • Printing all unique values or full DataFrames (use .head(), .value_counts().head())

CJK Font Setup (mandatory for charts)

When generating charts with matplotlib, copy this block as-is. Do NOT search for fonts.

import os
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

_FONT_PATHS = [
    '/mnt/afs_agents/SimHei.ttf',
    '/mnt/afs_agents/mnt/data/SimHei.ttf',
    os.path.expanduser('~/.fonts/SimHei.ttf'),
    '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc',
    '/usr/share/fonts/SimHei.ttf',
]
for _p in _FONT_PATHS:
    if os.path.exists(_p):
        fm.fontManager.addfont(_p)
        matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name()
        break
matplotlib.rcParams['axes.unicode_minus'] = False

How to load sub-skills

Each workflow step references one or more capability sub-skills. When you need the detailed code pattern for a step, load the sub-skill on demand:

read_file(path="<base_path>/<category>/<sub-skill-name>/SKILL.md")

Rules:

  • Only load the sub-skill(s) needed for your current step.
  • Do NOT load all sub-skills at once — it wastes context.
  • The top-level workflow (this file) is your guide; sub-skills provide detailed implementation patterns.

Available capability sub-skills

Base path: <skills_root>/sn-da-excel-workflow/capability/{category}/{sub-skill}/SKILL.md

excel-reading — 读取与解析

Sub-skill功能
single-sheet-reading读取单个工作表,支持合并单元格处理、交叉分析及多维度可视化
multi-sheet-reading读取多工作表,动态评估数据量启用Parquet优化,支持正则清洗、分类汇总与线性拟合
range-reading特定区域数据提取,根据数据量动态选择处理策略
large-excel-reading大型Excel文件处理,支持Parquet转换提速,生成带条件高亮的格式化报告
multi-file-reading多文件读取与统计,支持大文件Parquet转换与可视化报告
specific-sheet-reading跨Sheet特定字段统计、数据清洗与交叉分析,生成带下载链接的汇总报告
structured-header-reading动态识别目标列进行统计,正则清洗文本字段提取中文字符

excel-data-cleaning — 数据清洗

Sub-skill功能
missing-value-handling多Sheet智能清洗、跨表核对与可视化分析
duplicate-removal多Sheet去重统计,生成摘要与明细报表
invalid-data-cleaning正则清洗指定文本列(如保留中文字符),大文件自动Parquet加速
text-normalization文本标准化清洗(去除异常前缀、提取纯中文字符等)
numeric-format-normalization数值格式标准化,支持关键指标合计核对与结果文件导出
outlier-detectionIQR异常值检测,结合偏度/峰度分析数据分布,适用于非正态数据预处理

excel-data-filtering — 数据筛选

Sub-skill功能
condition-filtering根据数据规模动态选择处理策略进行条件筛选
category-filtering自定义分类统计、交叉分析,支持文本长度/术语密度/正则匹配等综合评分与分级
range-filtering根据多维数值条件筛选并导出,支持大规模数据自动性能优化
threshold-filtering数值列清洗、条件过滤,使用openpyxl对符合条件的单元格进行样式标记

excel-data-analysis — 数据分析

Sub-skill功能
comparison-analysis两类分类数据对比分析,统计数量差异与比例关系并生成可视化
group-by-analysis多Sheet数据清洗及分组聚合分析,生成带样式标记的统计表与图表
kpi-metric-analysis提取关键指标进行单位一致性验证与排序分析
pivot-table-analysis交叉表与热力图进行多维度占比分析,适用于奖项分布/绩效评估/市场占有率
time-series-analysis时间序列趋势分析、百分比清洗、绩效分级建模与预测,生成高分辨率可视化报告
trend-analysis多维度分级评估与趋势预测,差异化增长率计算,适用于绩效评估/目标设定

excel-data-statistics — 统计计算

Sub-skill功能
basic-statistics基础统计,支持按条件筛选计算均值,指定行区间提取数据去重求和
category-statistics各类别数量与占比统计,生成柱状图/饼图等组合可视化报告
grouped-statistics多Sheet数据合并与前向填充,分组统计
percentage-calculation逐行扫描或列匹配提取关键指标并计算占比/均值,输出结构化报告及图表

excel-data-visualization — 数据可视化

Sub-skill功能
bar-chart-visualization处理合并单元格,交叉分组统计,生成支持中英文字体的美化柱状图
histogram-visualization数值型分布分析与异常值检测,支持正则提取误差项,生成箱线图与直方图
line-chart-visualization特征清洗与聚类分析,生成趋势对比/分布特征/参数敏感性多维度图表
pie-chart-visualization分类汇总统计,自动识别关键字段生成包含占比/数值的美化饼图
scatter-plot-visualization多维度统计分析与散点图可视化
stacked-chart-visualization百分比字符串数据处理,补全缺失维度,生成堆叠柱状图展示构成变化趋势

excel-cell-coloring — 单元格着色

Sub-skill功能
category-coloring提取目标指标计算最大值,对特定行进行高亮标注
duplicate-value-coloring对比多表中的特定系数并对异常值进行颜色标记
outlier-coloring识别超限数值与错误单元格并进行高亮标注
threshold-cell-coloring计算时间序列平均值,使用openpyxl输出带条件格式(如低于均值标绿)的报告
top-value-coloring根据数据规模动态选择策略,多表合并、统计筛选,关键指标自动化样式高亮

excel-conditional-formatting — 条件格式

Sub-skill功能
data-bar-formatting从带单位字符串列提取数值并清洗,生成直方图/饼图/条形图/累积分布图

excel-result-export — 结果导出

Sub-skill功能
single-sheet-export多Sheet数据探查与条件过滤导出,重命名字段后生成带下载链接的Excel
formatted-export条件筛选记录并以整行标红格式导出Excel
chart-embedded-export分类分布清洗与统计,生成多维度交叉分析与高分辨率嵌入式图表报告
report-generation-export从Excel提取多类型数据,生成包含可视化图表与下载链接的综合分析报告

excel-table-styling — 表格样式

Sub-skill功能
table-theme-styling大文件Parquet加速读取,条件筛选/分类汇总与结果导出

Capabilities

skillsource-opensensenovaskill-sn-da-excel-workflowtopic-agenttopic-agent-skillstopic-ai-agentstopic-ai-assistanttopic-data-analysistopic-document-processingtopic-office-automationtopic-presentation-slides

Install

Quality

0.70/ 1.00

deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 1627 github stars · SKILL.md body (9,366 chars)

Provenance

Indexed fromgithub
Enriched2026-05-18 18:53:04Z · deterministic:skill-github:v1 · v1
First seen2026-05-15
Last seen2026-05-18

Agent access