sn-da-image-caption
图片理解与数据提取 skill。当图片文件(.png/.jpg/.jpeg/.gif/.webp/.bmp)是主要输入且用户需要理解、提取数据或分析图片内容时使用。提供预配置的 caption 脚本(scripts/caption.py),通过 vision 模型将图片转为文本描述,无需额外配置 API Key。覆盖:(1) 通过 scripts/caption.py 对图表/表格/截图/流程图进行 caption,(2) 将 caption 文本解析为结构化 DataFrame,(3) 基于提取数据重新生成可视化图表,(4) 导出为 Excel/CSV
What it does
Image Caption Analysis — 图片描述与数据提取
Overview
Analyze, extract data from, or understand image files (.png, .jpg, .jpeg, .gif, .webp, .bmp). The core workflow:
- Run
scripts/caption.pyto get a text description of the image - Parse the description into structured data (DataFrame, etc.)
- Analyze, visualize, or export
scripts/caption.py — Image Caption
The script converts images to text descriptions via a vision model. Configure via SN_API_KEY (minimum required), or use SN_VISION_API_KEY / SN_VISION_BASE_URL / SN_VISION_MODEL for fine-grained control. See the project environment variable spec for the full fallback chain.
Usage
# Basic — get text description
python3 scripts/caption.py /mnt/data/image.png
# Custom prompt — guide what to extract
python3 scripts/caption.py /mnt/data/chart.png --prompt "提取所有数值,Markdown 表格格式"
# JSON output — includes detected type, usage stats, cache info
python3 scripts/caption.py /mnt/data/image.png --json
# Batch — process all images in a directory
python3 scripts/caption.py /mnt/data/images/ --batch --output /mnt/data/captions.json
# Override model (optional)
python3 scripts/caption.py /mnt/data/image.png --model gemini-3.1-flash-lite-preview
Options
| Option | Description |
|---|---|
--prompt, -p | Custom prompt (overrides auto-detection) |
--model, -m | Vision model (default: sensenova-6.7-flash-lite) |
--json | Output structured JSON instead of plain text |
--batch | Process all images in a directory |
--output, -o | Output file for batch results |
--no-cache | Skip MD5 cache |
What it does automatically
- Type detection: Detects image type from filename (chart/table/UI/diagram/general) and picks the best prompt
- Compression: Images >5MB or >2048px are compressed before sending
- Caching: Same image + same prompt → instant cached result, no API cost
- Error handling: Retries on failure, returns error message on permanent failure
JSON output format
{
"file": "/mnt/data/image.png",
"type": "chart",
"description": "这是一张柱状图...",
"usage": {"prompt_tokens": 1100, "completion_tokens": 400},
"cached": false
}
Calling from Python
import subprocess, json
CAPTION = "/path/to/skills/sn-da-image-caption/scripts/caption.py"
# Single image
result = subprocess.run(
["python3", CAPTION, "/mnt/data/chart.png", "--json",
"--prompt", "提取图表数据,Markdown 表格输出"],
capture_output=True, text=True, timeout=60
)
data = json.loads(result.stdout)
description = data["description"]
# Batch
result = subprocess.run(
["python3", CAPTION, "/mnt/data/images/", "--batch",
"--output", "/mnt/data/captions.json"],
capture_output=True, text=True, timeout=300
)
with open("/mnt/data/captions.json") as f:
all_captions = json.load(f)
Prompt Strategy
Different image types need different prompts. The script auto-detects, but specifying --prompt gives better results.
| Image Type | When | Recommended --prompt |
|---|---|---|
| Data chart | 柱状图/折线图/饼图 | "提取图表标题、坐标轴、每个数据点数值、图例。Markdown 表格输出。" |
| Table screenshot | 表格截图 | "提取表格所有内容,Markdown 表格格式,保持行列结构,数值不四舍五入。" |
| UI screenshot | 界面截图 | "以前端开发者视角描述:布局、组件、文字、颜色。" |
| Diagram | 流程图/架构图 | "描述所有节点、连接关系(A→B)、分支条件。" |
| General | 照片、其他 | 不传 --prompt,用默认 |
Parsing Caption Results
Caption 通常返回 Markdown 表格,解析为 DataFrame:
import pandas as pd
def parse_markdown_table(text):
lines = text.strip().split('\n')
table_lines = []
in_table = False
for line in lines:
stripped = line.strip()
if '|' in stripped:
in_table = True
table_lines.append(stripped)
elif in_table:
break
data_lines = []
for l in table_lines:
cells = [c.strip() for c in l.split('|') if c.strip()]
if cells and not all(set(c) <= set('-: ') for c in cells):
data_lines.append(cells)
if len(data_lines) < 2:
return None
header = data_lines[0]
rows = [r for r in data_lines[1:] if len(r) == len(header)]
df = pd.DataFrame(rows, columns=header)
# Auto numeric conversion
for col in df.columns:
try:
cleaned = df[col].str.replace(',', '').str.strip()
if cleaned.str.endswith('%').any():
df[col] = pd.to_numeric(cleaned.str.rstrip('%'), errors='coerce')
else:
converted = pd.to_numeric(cleaned, errors='coerce')
if converted.notna().sum() > len(df) * 0.5:
df[col] = converted
except Exception:
pass
return df
Visualization
Chinese Font Setup (MANDATORY)
import matplotlib.pyplot as plt
import matplotlib
import os
font_path = '/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc'
if os.path.exists(font_path):
matplotlib.rcParams['font.family'] = 'WenQuanYi Zen Hei'
matplotlib.rcParams['axes.unicode_minus'] = False
Color Palette
COLORS = ['#4C72B0', '#55A868', '#C44E52', '#8172B2', '#CCB974', '#64B5CD']
Save & Display
plt.savefig('/mnt/data/chart.png', dpi=150, bbox_inches='tight')
plt.show()
print("")
Export to Excel
from openpyxl.styles import Font, PatternFill, Alignment
output_path = "/mnt/data/result.xlsx"
with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
df.to_excel(writer, index=False, sheet_name='提取数据')
ws = writer.sheets['提取数据']
fill = PatternFill(start_color='4472C4', end_color='4472C4', fill_type='solid')
for cell in ws[1]:
cell.font = Font(bold=True, color='FFFFFF')
cell.fill = fill
cell.alignment = Alignment(horizontal='center')
for i, col in enumerate(df.columns, 1):
w = max(df[col].astype(str).str.len().max(), len(str(col))) + 2
ws.column_dimensions[chr(64 + i)].width = min(w * 1.2, 40)
print(f"[下载](sandbox:{output_path})")
Multi-Image Processing
import glob
image_files = sorted(glob.glob("/mnt/data/*.png"))
all_dfs = []
for img in image_files:
r = subprocess.run(
["python3", CAPTION, img, "--json", "--prompt", "提取数据,Markdown 表格"],
capture_output=True, text=True, timeout=60
)
desc = json.loads(r.stdout)["description"]
df = parse_markdown_table(desc)
if df is not None:
all_dfs.append(df)
combined = pd.concat(all_dfs, ignore_index=True) if all_dfs else None
Or batch mode:
python3 scripts/caption.py /mnt/data/images/ --batch --output /mnt/data/captions.json
Common Pitfalls
- Always caption first — don't guess image content from filenames
- Use --prompt for precision — auto-detect is OK, explicit prompt is better
- Verify extracted data — check sums, percentages, row counts after parsing
- Large tables truncate — caption in two passes:
"提取前半部分"+"提取后半部分" - Chinese font — must set before any matplotlib call, or output is garbled
- Timeout — single image ~10-30s, batch set timeout accordingly
Capabilities
Install
Quality
deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 1627 github stars · SKILL.md body (7,125 chars)