Skillquality 0.70

sn-da-excel-workflow

Excel 数据分析多步编排器。覆盖：(1) 读取多 Sheet Excel 文件并统计行数，(2) 大文件检测（≥10k 行自动 Parquet 优化），(3) 数据清洗（缺失值、文本标准化、无效字符），(4) 条件筛选与分类提取，(5) 跨 Sheet 统计聚合，(6) 导出 Excel/CSV 并提供下载链接。覆盖从数据读取到报告生成全流程，按步骤编排 capability 子 skill。**遇到以下任一情况就主动使用本 skill，不要自行写几行 pandas 就回答**：①用户出现触发词：Excel 分析 / 表格分析 / 数据分析 / 数据

Price

free

Protocol

skill

Verified

Endpoint

https://skills.sh/OpenSenseNova/SenseNova-Skills/sn-da-excel-workflow

What it does

Excel Data Analysis Workflow

End-to-end workflow for structured Excel analysis. Each step maps to a capability sub-skill that can be loaded for detailed patterns.

Workflow

Step 1 — Count rows across all sheets (lightweight, no full load)

Count rows per sheet without loading data into memory. Use openpyxl read_only mode — this works for any file size.

import openpyxl, gc

wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True)
total_rows = 0
sheet_info = {}
for name in wb.sheetnames:
    ws = wb[name]
    row_count = sum(1 for _ in ws.iter_rows(min_row=2, values_only=True))
    total_rows += row_count
    sheet_info[name] = row_count
    print(f"Sheet '{name}': {row_count} rows")
wb.close()
print(f"总行数={total_rows}")

⚠️ Do NOT use pd.read_excel() to count rows — it loads all data into memory, which will OOM on large files.

→ capability: excel-reading/multi-sheet-reading

Step 2 — Large file gate (CRITICAL — choose strategy by row count)

total_rows	Strategy	What to do
< 10k	Direct read	`df = pd.read_excel(file_path, sheet_name=target_sheet)`
10k – 100k	Parquet cache	`pd.read_excel()` once → `df.to_parquet()` → all later reads from Parquet
>= 100k	STOP. Load `sn-da-large-file-analysis` skill	Read its SKILL.md, then follow its streaming read + Parquet pattern. Do NOT use `pd.read_excel()` at all — it will OOM or timeout on 100k+ rows.

For >= 100k rows:

read_file(path="<skills_base>/sn-da-large-file-analysis/SKILL.md")

Then use stream_excel_to_parquet() from that skill — it reads via openpyxl iter_rows in 50k-row chunks with constant memory.

For 10k – 100k rows (only):

import pandas as pd
parquet_path = "/tmp/_auto_parquet.parquet"
df = pd.read_excel(file_path, sheet_name=target_sheet)
df.to_parquet(parquet_path, engine="pyarrow")
del df; gc.collect()
df = pd.read_parquet(parquet_path)

→ capability: excel-reading/large-excel-reading

Step 3 — Inspect schema & data types

Preview target sheet structure. For large files (>= 10k rows), only read a small sample — never full load just to inspect.

# For any file size — read only first N rows for inspection
df_head = pd.read_excel(file_path, sheet_name=target_sheet, nrows=20)
print(f"Columns: {df_head.columns.tolist()}")
print(f"Dtypes:\n{df_head.dtypes}")
print(df_head.head(10))

→ capability: excel-reading/range-reading

Step 4 — Data cleaning

Handle missing values, normalize text, clean invalid characters.

# Missing values
null_count = df[col].isna().sum()

# Text cleaning: keep only Chinese characters
import re
def clean_text(val):
    if pd.isna(val): return val
    return "".join(re.findall(r"[\u4e00-\u9fff]", str(val))) or ""

df[col] = df[col].apply(clean_text)

⚠️ Large file rule: When total_rows >= 100k, do NOT use df.apply(lambda...). Use vectorized operations or np.where() instead. See sn-da-large-file-analysis skill for the vectorized cheat sheet.

→ capabilities:

excel-data-cleaning/missing-value-handling
excel-data-cleaning/invalid-data-cleaning
excel-data-cleaning/text-normalization

Step 5 — Filter & extract

Apply condition or category filters, aggregate results.

# Condition filter
mask = df[col].astype(str).str.strip() == target_value
filtered = df[mask]

# Category extraction (for headerless layouts)
df_raw = pd.read_excel(file_path, sheet_name=sheet, header=None)
# Walk rows to find category markers, collect items until next marker

→ capabilities:

excel-data-filtering/condition-filtering
excel-data-filtering/category-filtering
excel-data-filtering/threshold-filtering

Step 6 — Export results

Save filtered/cleaned data as Excel or CSV. Provide download link.

output_path = "/mnt/data/result.xlsx"
result_df.to_excel(output_path, index=False)
print(f"[Download](sandbox:{output_path})")

→ capabilities:

excel-result-export/single-sheet-export
excel-result-export/formatted-export

Key rules

Always count rows first — gate large-file logic on the 10k threshold.
>= 100k rows → MUST load sn-da-large-file-analysis skill — do not attempt to handle with pd.read_excel().
Column names may contain spaces (e.g. '是否通过') — use exact string indexing.
Headerless sheets — use header=None and positional indexing.
Prohibited on large files (>= 100k rows):
- pd.read_excel() for full load (use streaming read → Parquet)
- df.apply(lambda...) or df.iterrows() (use vectorized ops or itertuples())
- fc-list, find ... fonts, subprocess to search fonts, or pip install (use fixed font paths below)
- Printing all unique values or full DataFrames (use .head(), .value_counts().head())

CJK Font Setup (mandatory for charts)

When generating charts with matplotlib, copy this block as-is. Do NOT search for fonts.

import os
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

_FONT_PATHS = [
    '/mnt/afs_agents/SimHei.ttf',
    '/mnt/afs_agents/mnt/data/SimHei.ttf',
    os.path.expanduser('~/.fonts/SimHei.ttf'),
    '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc',
    '/usr/share/fonts/SimHei.ttf',
]
for _p in _FONT_PATHS:
    if os.path.exists(_p):
        fm.fontManager.addfont(_p)
        matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).get_name()
        break
matplotlib.rcParams['axes.unicode_minus'] = False

How to load sub-skills

Each workflow step references one or more capability sub-skills. When you need the detailed code pattern for a step, load the sub-skill on demand:

read_file(path="<base_path>/<category>/<sub-skill-name>/SKILL.md")

Rules:

Only load the sub-skill(s) needed for your current step.
Do NOT load all sub-skills at once — it wastes context.
The top-level workflow (this file) is your guide; sub-skills provide detailed implementation patterns.

Available capability sub-skills

Base path: <skills_root>/sn-da-excel-workflow/capability/{category}/{sub-skill}/SKILL.md

excel-reading — 读取与解析

Sub-skill	功能
single-sheet-reading	读取单个工作表，支持合并单元格处理、交叉分析及多维度可视化
multi-sheet-reading	读取多工作表，动态评估数据量启用Parquet优化，支持正则清洗、分类汇总与线性拟合
range-reading	特定区域数据提取，根据数据量动态选择处理策略
large-excel-reading	大型Excel文件处理，支持Parquet转换提速，生成带条件高亮的格式化报告
multi-file-reading	多文件读取与统计，支持大文件Parquet转换与可视化报告
specific-sheet-reading	跨Sheet特定字段统计、数据清洗与交叉分析，生成带下载链接的汇总报告
structured-header-reading	动态识别目标列进行统计，正则清洗文本字段提取中文字符

excel-data-cleaning — 数据清洗

Sub-skill	功能
missing-value-handling	多Sheet智能清洗、跨表核对与可视化分析
duplicate-removal	多Sheet去重统计，生成摘要与明细报表
invalid-data-cleaning	正则清洗指定文本列（如保留中文字符），大文件自动Parquet加速
text-normalization	文本标准化清洗（去除异常前缀、提取纯中文字符等）
numeric-format-normalization	数值格式标准化，支持关键指标合计核对与结果文件导出
outlier-detection	IQR异常值检测，结合偏度/峰度分析数据分布，适用于非正态数据预处理

excel-data-filtering — 数据筛选

Sub-skill	功能
condition-filtering	根据数据规模动态选择处理策略进行条件筛选
category-filtering	自定义分类统计、交叉分析，支持文本长度/术语密度/正则匹配等综合评分与分级
range-filtering	根据多维数值条件筛选并导出，支持大规模数据自动性能优化
threshold-filtering	数值列清洗、条件过滤，使用openpyxl对符合条件的单元格进行样式标记

excel-data-analysis — 数据分析

Sub-skill	功能
comparison-analysis	两类分类数据对比分析，统计数量差异与比例关系并生成可视化
group-by-analysis	多Sheet数据清洗及分组聚合分析，生成带样式标记的统计表与图表
kpi-metric-analysis	提取关键指标进行单位一致性验证与排序分析
pivot-table-analysis	交叉表与热力图进行多维度占比分析，适用于奖项分布/绩效评估/市场占有率
time-series-analysis	时间序列趋势分析、百分比清洗、绩效分级建模与预测，生成高分辨率可视化报告
trend-analysis	多维度分级评估与趋势预测，差异化增长率计算，适用于绩效评估/目标设定

excel-data-statistics — 统计计算

Sub-skill	功能
basic-statistics	基础统计，支持按条件筛选计算均值，指定行区间提取数据去重求和
category-statistics	各类别数量与占比统计，生成柱状图/饼图等组合可视化报告
grouped-statistics	多Sheet数据合并与前向填充，分组统计
percentage-calculation	逐行扫描或列匹配提取关键指标并计算占比/均值，输出结构化报告及图表

excel-data-visualization — 数据可视化

Sub-skill	功能
bar-chart-visualization	处理合并单元格，交叉分组统计，生成支持中英文字体的美化柱状图
histogram-visualization	数值型分布分析与异常值检测，支持正则提取误差项，生成箱线图与直方图
line-chart-visualization	特征清洗与聚类分析，生成趋势对比/分布特征/参数敏感性多维度图表
pie-chart-visualization	分类汇总统计，自动识别关键字段生成包含占比/数值的美化饼图
scatter-plot-visualization	多维度统计分析与散点图可视化
stacked-chart-visualization	百分比字符串数据处理，补全缺失维度，生成堆叠柱状图展示构成变化趋势

excel-cell-coloring — 单元格着色

Sub-skill	功能
category-coloring	提取目标指标计算最大值，对特定行进行高亮标注
duplicate-value-coloring	对比多表中的特定系数并对异常值进行颜色标记
outlier-coloring	识别超限数值与错误单元格并进行高亮标注
threshold-cell-coloring	计算时间序列平均值，使用openpyxl输出带条件格式（如低于均值标绿）的报告
top-value-coloring	根据数据规模动态选择策略，多表合并、统计筛选，关键指标自动化样式高亮

excel-conditional-formatting — 条件格式

Sub-skill	功能
data-bar-formatting	从带单位字符串列提取数值并清洗，生成直方图/饼图/条形图/累积分布图

excel-result-export — 结果导出

Sub-skill	功能
single-sheet-export	多Sheet数据探查与条件过滤导出，重命名字段后生成带下载链接的Excel
formatted-export	条件筛选记录并以整行标红格式导出Excel
chart-embedded-export	分类分布清洗与统计，生成多维度交叉分析与高分辨率嵌入式图表报告
report-generation-export	从Excel提取多类型数据，生成包含可视化图表与下载链接的综合分析报告

excel-table-styling — 表格样式

Sub-skill	功能
table-theme-styling	大文件Parquet加速读取，条件筛选/分类汇总与结果导出

Capabilities

skillsource-opensensenovaskill-sn-da-excel-workflowtopic-agenttopic-agent-skillstopic-ai-agentstopic-ai-assistanttopic-data-analysistopic-document-processingtopic-office-automationtopic-presentation-slides

Install

Installnpx skills add OpenSenseNova/SenseNova-Skills

Sourcehttps://github.com/OpenSenseNova/SenseNova-Skills/tree/main/skills/sn-da-excel-workflow

skills.shhttps://skills.sh/OpenSenseNova/SenseNova-Skills/sn-da-excel-workflow

Transportskills-sh

Protocolskill

Quality

0.70/ 1.00

deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 1627 github stars · SKILL.md body (9,366 chars)

Provenance

Indexed fromgithub

Enriched2026-05-18 18:53:04Z · deterministic:skill-github:v1 · v1

First seen2026-05-15

Last seen2026-05-18

Agent access

JSONhttps://clawmart.sh/api/listings/RcWPBA