{"id":"170f7517-4b26-495f-918e-86c171720a10","shortId":"NP9s5Y","kind":"skill","title":"spark-optimization","tagline":"Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.","description":"# Apache Spark Optimization\n\nProduction patterns for optimizing Apache Spark jobs including partitioning strategies, memory management, shuffle optimization, and performance tuning.\n\n## Do not use this skill when\n\n- The task is unrelated to apache spark optimization\n- You need a different domain or tool outside this scope\n\n## Instructions\n\n- Clarify goals, constraints, and required inputs.\n- Apply relevant best practices and validate outcomes.\n- Provide actionable steps and verification.\n- If detailed examples are required, open `resources/implementation-playbook.md`.\n\n## Use this skill when\n\n- Optimizing slow Spark jobs\n- Tuning memory and executor configuration\n- Implementing efficient partitioning strategies\n- Debugging Spark performance issues\n- Scaling Spark pipelines for large datasets\n- Reducing shuffle and data skew\n\n## Core Concepts\n\n### 1. Spark Execution Model\n\n```\nDriver Program\n    ↓\nJob (triggered by action)\n    ↓\nStages (separated by shuffles)\n    ↓\nTasks (one per partition)\n```\n\n### 2. Key Performance Factors\n\n| Factor | Impact | Solution |\n|--------|--------|----------|\n| **Shuffle** | Network I/O, disk I/O | Minimize wide transformations |\n| **Data Skew** | Uneven task duration | Salting, broadcast joins |\n| **Serialization** | CPU overhead | Use Kryo, columnar formats |\n| **Memory** | GC pressure, spills | Tune executor memory |\n| **Partitions** | Parallelism | Right-size partitions |\n\n## Quick Start\n\n```python\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql import functions as F\n\n# Create optimized Spark session\nspark = (SparkSession.builder\n    .appName(\"OptimizedJob\")\n    .config(\"spark.sql.adaptive.enabled\", \"true\")\n    .config(\"spark.sql.adaptive.coalescePartitions.enabled\", \"true\")\n    .config(\"spark.sql.adaptive.skewJoin.enabled\", \"true\")\n    .config(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\")\n    .config(\"spark.sql.shuffle.partitions\", \"200\")\n    .getOrCreate())\n\n# Read with optimized settings\ndf = (spark.read\n    .format(\"parquet\")\n    .option(\"mergeSchema\", \"false\")\n    .load(\"s3://bucket/data/\"))\n\n# Efficient transformations\nresult = (df\n    .filter(F.col(\"date\") >= \"2024-01-01\")\n    .select(\"id\", \"amount\", \"category\")\n    .groupBy(\"category\")\n    .agg(F.sum(\"amount\").alias(\"total\")))\n\nresult.write.mode(\"overwrite\").parquet(\"s3://bucket/output/\")\n```\n\n## Patterns\n\n### Pattern 1: Optimal Partitioning\n\n```python\n# Calculate optimal partition count\ndef calculate_partitions(data_size_gb: float, partition_size_mb: int = 128) -> int:\n    \"\"\"\n    Optimal partition size: 128MB - 256MB\n    Too few: Under-utilization, memory pressure\n    Too many: Task scheduling overhead\n    \"\"\"\n    return max(int(data_size_gb * 1024 / partition_size_mb), 1)\n\n# Repartition for even distribution\ndf_repartitioned = df.repartition(200, \"partition_key\")\n\n# Coalesce to reduce partitions (no shuffle)\ndf_coalesced = df.coalesce(100)\n\n# Partition pruning with predicate pushdown\ndf = (spark.read.parquet(\"s3://bucket/data/\")\n    .filter(F.col(\"date\") == \"2024-01-01\"))  # Spark pushes this down\n\n# Write with partitioning for future queries\n(df.write\n    .partitionBy(\"year\", \"month\", \"day\")\n    .mode(\"overwrite\")\n    .parquet(\"s3://bucket/partitioned_output/\"))\n```\n\n### Pattern 2: Join Optimization\n\n```python\nfrom pyspark.sql import functions as F\nfrom pyspark.sql.types import *\n\n# 1. Broadcast Join - Small table joins\n# Best when: One side < 10MB (configurable)\nsmall_df = spark.read.parquet(\"s3://bucket/small_table/\")  # < 10MB\nlarge_df = spark.read.parquet(\"s3://bucket/large_table/\")  # TBs\n\n# Explicit broadcast hint\nresult = large_df.join(\n    F.broadcast(small_df),\n    on=\"key\",\n    how=\"left\"\n)\n\n# 2. Sort-Merge Join - Default for large tables\n# Requires shuffle, but handles any size\nresult = large_df1.join(large_df2, on=\"key\", how=\"inner\")\n\n# 3. Bucket Join - Pre-sorted, no shuffle at join time\n# Write bucketed tables\n(df.write\n    .bucketBy(200, \"customer_id\")\n    .sortBy(\"customer_id\")\n    .mode(\"overwrite\")\n    .saveAsTable(\"bucketed_orders\"))\n\n# Join bucketed tables (no shuffle!)\norders = spark.table(\"bucketed_orders\")\ncustomers = spark.table(\"bucketed_customers\")  # Same bucket count\nresult = orders.join(customers, on=\"customer_id\")\n\n# 4. Skew Join Handling\n# Enable AQE skew join optimization\nspark.conf.set(\"spark.sql.adaptive.skewJoin.enabled\", \"true\")\nspark.conf.set(\"spark.sql.adaptive.skewJoin.skewedPartitionFactor\", \"5\")\nspark.conf.set(\"spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes\", \"256MB\")\n\n# Manual salting for severe skew\ndef salt_join(df_skewed, df_other, key_col, num_salts=10):\n    \"\"\"Add salt to distribute skewed keys\"\"\"\n    # Add salt to skewed side\n    df_salted = df_skewed.withColumn(\n        \"salt\",\n        (F.rand() * num_salts).cast(\"int\")\n    ).withColumn(\n        \"salted_key\",\n        F.concat(F.col(key_col), F.lit(\"_\"), F.col(\"salt\"))\n    )\n\n    # Explode other side with all salts\n    df_exploded = df_other.crossJoin(\n        spark.range(num_salts).withColumnRenamed(\"id\", \"salt\")\n    ).withColumn(\n        \"salted_key\",\n        F.concat(F.col(key_col), F.lit(\"_\"), F.col(\"salt\"))\n    )\n\n    # Join on salted key\n    return df_salted.join(df_exploded, on=\"salted_key\", how=\"inner\")\n```\n\n### Pattern 3: Caching and Persistence\n\n```python\nfrom pyspark import StorageLevel\n\n# Cache when reusing DataFrame multiple times\ndf = spark.read.parquet(\"s3://bucket/data/\")\ndf_filtered = df.filter(F.col(\"status\") == \"active\")\n\n# Cache in memory (MEMORY_AND_DISK is default)\ndf_filtered.cache()\n\n# Or with specific storage level\ndf_filtered.persist(StorageLevel.MEMORY_AND_DISK_SER)\n\n# Force materialization\ndf_filtered.count()\n\n# Use in multiple actions\nagg1 = df_filtered.groupBy(\"category\").count()\nagg2 = df_filtered.groupBy(\"region\").sum(\"amount\")\n\n# Unpersist when done\ndf_filtered.unpersist()\n\n# Storage levels explained:\n# MEMORY_ONLY - Fast, but may not fit\n# MEMORY_AND_DISK - Spills to disk if needed (recommended)\n# MEMORY_ONLY_SER - Serialized, less memory, more CPU\n# DISK_ONLY - When memory is tight\n# OFF_HEAP - Tungsten off-heap memory\n\n# Checkpoint for complex lineage\nspark.sparkContext.setCheckpointDir(\"s3://bucket/checkpoints/\")\ndf_complex = (df\n    .join(other_df, \"key\")\n    .groupBy(\"category\")\n    .agg(F.sum(\"amount\")))\ndf_complex.checkpoint()  # Breaks lineage, materializes\n```\n\n### Pattern 4: Memory Tuning\n\n```python\n# Executor memory configuration\n# spark-submit --executor-memory 8g --executor-cores 4\n\n# Memory breakdown (8GB executor):\n# - spark.memory.fraction = 0.6 (60% = 4.8GB for execution + storage)\n#   - spark.memory.storageFraction = 0.5 (50% of 4.8GB = 2.4GB for cache)\n#   - Remaining 2.4GB for execution (shuffles, joins, sorts)\n# - 40% = 3.2GB for user data structures and internal metadata\n\nspark = (SparkSession.builder\n    .config(\"spark.executor.memory\", \"8g\")\n    .config(\"spark.executor.memoryOverhead\", \"2g\")  # For non-JVM memory\n    .config(\"spark.memory.fraction\", \"0.6\")\n    .config(\"spark.memory.storageFraction\", \"0.5\")\n    .config(\"spark.sql.shuffle.partitions\", \"200\")\n    # For memory-intensive operations\n    .config(\"spark.sql.autoBroadcastJoinThreshold\", \"50MB\")\n    # Prevent OOM on large shuffles\n    .config(\"spark.sql.files.maxPartitionBytes\", \"128MB\")\n    .getOrCreate())\n\n# Monitor memory usage\ndef print_memory_usage(spark):\n    \"\"\"Print current memory usage\"\"\"\n    sc = spark.sparkContext\n    for executor in sc._jsc.sc().getExecutorMemoryStatus().keySet().toArray():\n        mem_status = sc._jsc.sc().getExecutorMemoryStatus().get(executor)\n        total = mem_status._1() / (1024**3)\n        free = mem_status._2() / (1024**3)\n        print(f\"{executor}: {total:.2f}GB total, {free:.2f}GB free\")\n```\n\n### Pattern 5: Shuffle Optimization\n\n```python\n# Reduce shuffle data size\nspark.conf.set(\"spark.sql.shuffle.partitions\", \"auto\")  # With AQE\nspark.conf.set(\"spark.shuffle.compress\", \"true\")\nspark.conf.set(\"spark.shuffle.spill.compress\", \"true\")\n\n# Pre-aggregate before shuffle\ndf_optimized = (df\n    # Local aggregation first (combiner)\n    .groupBy(\"key\", \"partition_col\")\n    .agg(F.sum(\"value\").alias(\"partial_sum\"))\n    # Then global aggregation\n    .groupBy(\"key\")\n    .agg(F.sum(\"partial_sum\").alias(\"total\")))\n\n# Avoid shuffle with map-side operations\n# BAD: Shuffle for each distinct\ndistinct_count = df.select(\"category\").distinct().count()\n\n# GOOD: Approximate distinct (no shuffle)\napprox_count = df.select(F.approx_count_distinct(\"category\")).collect()[0][0]\n\n# Use coalesce instead of repartition when reducing partitions\ndf_reduced = df.coalesce(10)  # No shuffle\n\n# Optimize shuffle with compression\nspark.conf.set(\"spark.io.compression.codec\", \"lz4\")  # Fast compression\n```\n\n### Pattern 6: Data Format Optimization\n\n```python\n# Parquet optimizations\n(df.write\n    .option(\"compression\", \"snappy\")  # Fast compression\n    .option(\"parquet.block.size\", 128 * 1024 * 1024)  # 128MB row groups\n    .parquet(\"s3://bucket/output/\"))\n\n# Column pruning - only read needed columns\ndf = (spark.read.parquet(\"s3://bucket/data/\")\n    .select(\"id\", \"amount\", \"date\"))  # Spark only reads these columns\n\n# Predicate pushdown - filter at storage level\ndf = (spark.read.parquet(\"s3://bucket/partitioned/year=2024/\")\n    .filter(F.col(\"status\") == \"active\"))  # Pushed to Parquet reader\n\n# Delta Lake optimizations\n(df.write\n    .format(\"delta\")\n    .option(\"optimizeWrite\", \"true\")  # Bin-packing\n    .option(\"autoCompact\", \"true\")  # Compact small files\n    .mode(\"overwrite\")\n    .save(\"s3://bucket/delta_table/\"))\n\n# Z-ordering for multi-dimensional queries\nspark.sql(\"\"\"\n    OPTIMIZE delta.`s3://bucket/delta_table/`\n    ZORDER BY (customer_id, date)\n\"\"\")\n```\n\n### Pattern 7: Monitoring and Debugging\n\n```python\n# Enable detailed metrics\nspark.conf.set(\"spark.sql.codegen.wholeStage\", \"true\")\nspark.conf.set(\"spark.sql.execution.arrow.pyspark.enabled\", \"true\")\n\n# Explain query plan\ndf.explain(mode=\"extended\")\n# Modes: simple, extended, codegen, cost, formatted\n\n# Get physical plan statistics\ndf.explain(mode=\"cost\")\n\n# Monitor task metrics\ndef analyze_stage_metrics(spark):\n    \"\"\"Analyze recent stage metrics\"\"\"\n    status_tracker = spark.sparkContext.statusTracker()\n\n    for stage_id in status_tracker.getActiveStageIds():\n        stage_info = status_tracker.getStageInfo(stage_id)\n        print(f\"Stage {stage_id}:\")\n        print(f\"  Tasks: {stage_info.numTasks}\")\n        print(f\"  Completed: {stage_info.numCompletedTasks}\")\n        print(f\"  Failed: {stage_info.numFailedTasks}\")\n\n# Identify data skew\ndef check_partition_skew(df):\n    \"\"\"Check for partition skew\"\"\"\n    partition_counts = (df\n        .withColumn(\"partition_id\", F.spark_partition_id())\n        .groupBy(\"partition_id\")\n        .count()\n        .orderBy(F.desc(\"count\")))\n\n    partition_counts.show(20)\n\n    stats = partition_counts.select(\n        F.min(\"count\").alias(\"min\"),\n        F.max(\"count\").alias(\"max\"),\n        F.avg(\"count\").alias(\"avg\"),\n        F.stddev(\"count\").alias(\"stddev\")\n    ).collect()[0]\n\n    skew_ratio = stats[\"max\"] / stats[\"avg\"]\n    print(f\"Skew ratio: {skew_ratio:.2f}x (>2x indicates skew)\")\n```\n\n## Configuration Cheat Sheet\n\n```python\n# Production configuration template\nspark_configs = {\n    # Adaptive Query Execution (AQE)\n    \"spark.sql.adaptive.enabled\": \"true\",\n    \"spark.sql.adaptive.coalescePartitions.enabled\": \"true\",\n    \"spark.sql.adaptive.skewJoin.enabled\": \"true\",\n\n    # Memory\n    \"spark.executor.memory\": \"8g\",\n    \"spark.executor.memoryOverhead\": \"2g\",\n    \"spark.memory.fraction\": \"0.6\",\n    \"spark.memory.storageFraction\": \"0.5\",\n\n    # Parallelism\n    \"spark.sql.shuffle.partitions\": \"200\",\n    \"spark.default.parallelism\": \"200\",\n\n    # Serialization\n    \"spark.serializer\": \"org.apache.spark.serializer.KryoSerializer\",\n    \"spark.sql.execution.arrow.pyspark.enabled\": \"true\",\n\n    # Compression\n    \"spark.io.compression.codec\": \"lz4\",\n    \"spark.shuffle.compress\": \"true\",\n\n    # Broadcast\n    \"spark.sql.autoBroadcastJoinThreshold\": \"50MB\",\n\n    # File handling\n    \"spark.sql.files.maxPartitionBytes\": \"128MB\",\n    \"spark.sql.files.openCostInBytes\": \"4MB\",\n}\n```\n\n## Best Practices\n\n### Do's\n- **Enable AQE** - Adaptive query execution handles many issues\n- **Use Parquet/Delta** - Columnar formats with compression\n- **Broadcast small tables** - Avoid shuffle for small joins\n- **Monitor Spark UI** - Check for skew, spills, GC\n- **Right-size partitions** - 128MB - 256MB per partition\n\n### Don'ts\n- **Don't collect large data** - Keep data distributed\n- **Don't use UDFs unnecessarily** - Use built-in functions\n- **Don't over-cache** - Memory is limited\n- **Don't ignore data skew** - It dominates job time\n- **Don't use `.count()` for existence** - Use `.take(1)` or `.isEmpty()`\n\n## Resources\n\n- [Spark Performance Tuning](https://spark.apache.org/docs/latest/sql-performance-tuning.html)\n- [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html)\n- [Databricks Optimization Guide](https://docs.databricks.com/en/optimizations/index.html)\n\n## Limitations\n- Use this skill only when the task clearly matches the scope described above.\n- Do not treat the output as a substitute for environment-specific validation, testing, or expert review.\n- Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.","tags":["spark","optimization","antigravity","awesome","skills","sickn33","agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows"],"capabilities":["skill","source-sickn33","skill-spark-optimization","topic-agent-skills","topic-agentic-skills","topic-ai-agent-skills","topic-ai-agents","topic-ai-coding","topic-ai-workflows","topic-antigravity","topic-antigravity-skills","topic-claude-code","topic-claude-code-skills","topic-codex-cli","topic-codex-skills"],"categories":["antigravity-awesome-skills"],"synonyms":[],"warnings":[],"endpointUrl":"https://skills.sh/sickn33/antigravity-awesome-skills/spark-optimization","protocol":"skill","transport":"skills-sh","auth":{"type":"none","details":{"cli":"npx skills add sickn33/antigravity-awesome-skills","source_repo":"https://github.com/sickn33/antigravity-awesome-skills","install_from":"skills.sh"}},"qualityScore":"0.700","qualityRationale":"deterministic score 0.70 from registry signals: · indexed on github topic:agent-skills · 34515 github stars · SKILL.md body (13,351 chars)","verified":false,"liveness":"unknown","lastLivenessCheck":null,"agentReviews":{"count":0,"score_avg":null,"cost_usd_avg":null,"success_rate":null,"latency_p50_ms":null,"narrative_summary":null,"summary_updated_at":null},"enrichmentModel":"deterministic:skill-github:v1","enrichmentVersion":1,"enrichedAt":"2026-04-22T12:51:48.898Z","embedding":null,"createdAt":"2026-04-18T21:45:15.756Z","updatedAt":"2026-04-22T12:51:48.898Z","lastSeenAt":"2026-04-22T12:51:48.898Z","tsv":"'-01':253,254,355,356 '/bucket/checkpoints':713 '/bucket/data':244,350,621,1019 '/bucket/delta_table':1070,1083 '/bucket/large_table':413 '/bucket/output':270,1009 '/bucket/partitioned/year':1038 '/bucket/partitioned_output':376 '/bucket/small_table':407 '/docs/latest/configuration.html)':1385 '/docs/latest/sql-performance-tuning.html)':1380 '/en/optimizations/index.html)':1391 '0':960,961,1214 '0.5':762,807,1259 '0.6':754,804,1257 '1':133,273,321,391,1371 '10':533,973 '100':341 '1024':317,858,863,1002,1003 '10mb':401,408 '128':292,1001 '128mb':297,826,1004,1281,1322 '2':151,378,427 '2.4':767,772 '20':1194 '200':229,329,466,810,1262,1264 '2024':252,354,1039 '256mb':298,516,1323 '2f':869,873,1227 '2g':796,1255 '2x':1229 '3':450,603,859,864 '3.2':780 '4':499,731,748 '4.8':756,765 '40':779 '4mb':1283 '5':513,877 '50':763 '50mb':818,1277 '6':986 '60':755 '7':1090 '8g':744,793,1253 '8gb':751 'action':88,142,653 'activ':627,1043 'adapt':1241,1290 'add':534,540 'agg':261,723,912,923 'agg1':654 'agg2':658 'aggreg':898,905,920 'alia':264,915,927,1199,1203,1207,1211 'amount':257,263,662,725,1022 'analyz':1127,1131 'apach':5,29,36,60 'appli':80 'appnam':213 'approx':952 'approxim':948 'aqe':504,889,1244,1289 'ask':1425 'auto':887 'autocompact':1061 'avg':1208,1220 'avoid':929,1305 'bad':936 'best':82,397,1284 'bin':1058 'bin-pack':1057 'boundari':1433 'break':727 'breakdown':750 'broadcast':172,392,416,1275,1302 'bucket':451,462,475,478,484,488,491 'bucketbi':465 'built':1343 'built-in':1342 'cach':10,604,612,628,770,1350 'calcul':277,282 'cast':552 'categori':258,260,656,722,944,958 'cheat':1233 'check':1169,1173,1313 'checkpoint':707 'clarif':1427 'clarifi':74 'clear':1400 'coalesc':332,339,963 'codegen':1113 'col':530,560,585,911 'collect':959,1213,1330 'column':1010,1015,1028 'columnar':179,1298 'combin':907 'compact':1063 'complet':1159 'complex':709,715 'compress':979,984,995,998,1270,1301 'concept':132 'config':215,218,221,224,227,791,794,802,805,808,816,824,1240 'configur':111,402,737,1232,1237,1382 'constraint':76 'core':131,747 'cost':1114,1122 'count':280,492,657,942,946,953,956,1178,1189,1192,1198,1202,1206,1210,1366 'cpu':175,693 'creat':207 'criteria':1436 'current':837 'custom':467,470,486,489,495,497,1086 'data':26,129,166,284,314,784,883,987,1166,1332,1334,1357 'databrick':1386 'datafram':615 'dataset':125 'date':251,353,1023,1088 'day':371 'debug':21,116,1093 'def':281,522,831,1126,1168 'default':432,635 'delta':1048,1053,1081 'describ':1404 'detail':93,1096 'df':235,248,326,338,347,404,410,422,525,527,545,570,595,618,622,714,716,719,901,903,970,1016,1035,1172,1179 'df.coalesce':340,972 'df.explain':1107,1120 'df.filter':624 'df.repartition':328 'df.select':943,954 'df.write':367,464,993,1051 'df2':445 'df_complex.checkpoint':726 'df_filtered.cache':636 'df_filtered.count':649 'df_filtered.groupby':655,659 'df_filtered.persist':642 'df_filtered.unpersist':666 'df_other.crossjoin':572 'df_salted.join':594 'df_skewed.withcolumn':547 'differ':66 'dimension':1077 'disk':161,633,645,679,682,694 'distinct':940,941,945,949,957 'distribut':325,537,1335 'docs.databricks.com':1390 'docs.databricks.com/en/optimizations/index.html)':1389 'domain':67 'domin':1360 'done':665 'driver':137 'durat':170 'effici':113,245 'enabl':503,1095,1288 'environ':1416 'environment-specif':1415 'even':324 'exampl':94 'execut':135,759,775,1243,1292 'executor':110,186,735,742,746,752,843,854,867 'executor-cor':745 'executor-memori':741 'exist':1368 'expert':1421 'explain':669,1104 'explicit':415 'explod':564,571,596 'extend':1109,1112 'f':206,387,866,1149,1154,1158,1162,1222 'f.approx':955 'f.avg':1205 'f.broadcast':420 'f.col':250,352,558,562,583,587,625,1041 'f.concat':557,582 'f.desc':1191 'f.lit':561,586 'f.max':1201 'f.min':1197 'f.rand':549 'f.spark':1183 'f.stddev':1209 'f.sum':262,724,913,924 'factor':154,155 'fail':1163 'fals':241 'fast':672,983,997 'file':1065,1278 'filter':249,351,623,1031,1040 'first':906 'fit':676 'float':287 'forc':647 'format':180,237,988,1052,1115,1299 'free':860,872,875 'function':204,385,1345 'futur':365 'gb':286,316,757,766,768,773,781,870,874 'gc':182,1317 'get':853,1116 'getexecutormemorystatus':846,852 'getorcr':230,827 'global':919 'goal':75 'good':947 'group':1006 'groupbi':259,721,908,921,1186 'guid':1388 'handl':439,502,1279,1293 'heap':701,705 'hint':417 'i/o':160,162 'id':256,468,471,498,577,1021,1087,1140,1147,1152,1182,1185,1188 'identifi':1165 'ignor':1356 'impact':156 'implement':112 'import':199,203,384,390,610 'improv':18 'includ':39 'indic':1230 'info':1144 'inner':449,601 'input':79,1430 'instead':964 'instruct':73 'int':291,293,313,553 'intens':814 'intern':787 'isempti':1373 'issu':119,1295 'job':7,23,38,106,139,1361 'join':173,379,393,396,431,452,459,477,501,506,524,589,717,777,1309 'jvm':800 'keep':1333 'key':152,331,424,447,529,539,556,559,581,584,592,599,720,909,922 'keyset':847 'kryo':178 'lake':1049 'larg':124,409,434,444,822,1331 'large_df.join':419 'large_df1.join':443 'left':426 'less':690 'level':641,668,1034 'limit':1353,1392 'lineag':710,728 'load':242 'local':904 'lz4':982,1272 'manag':43 'mani':307,1294 'manual':517 'map':933 'map-sid':932 'match':1401 'materi':648,729 'max':312,1204,1218 'may':674 'mb':290,320 'mem':849,856,861 'memori':14,42,108,181,187,304,630,631,670,677,686,691,697,706,732,736,743,749,801,813,829,833,838,1251,1351 'memory-intens':812 'merg':430 'mergeschema':240 'metadata':788 'metric':1097,1125,1129,1134 'min':1200 'minim':163 'miss':1438 'mode':372,472,1066,1108,1110,1121 'model':136 'monitor':828,1091,1123,1310 'month':370 'multi':1076 'multi-dimension':1075 'multipl':616,652 'need':64,684,1014 'network':159 'non':799 'non-jvm':798 'num':531,550,574 'off-heap':703 'one':148,399 'oom':820 'open':97 'oper':815,935 'optim':3,4,12,31,35,45,62,103,208,233,274,278,294,380,507,879,902,976,989,992,1050,1080,1387 'optimizedjob':214 'optimizewrit':1055 'option':239,994,999,1054,1060 'order':476,482,485,1073 'orderbi':1190 'orders.join':494 'org.apache.spark.serializer.kryoserializer':226,1267 'outcom':86 'output':1410 'outsid':70 'over-cach':1348 'overhead':176,310 'overwrit':267,373,473,1067 'pack':1059 'parallel':189,1260 'parquet':238,268,374,991,1007,1046 'parquet.block.size':1000 'parquet/delta':1297 'partial':916,925 'partit':9,40,114,150,188,193,275,279,283,288,295,318,330,335,342,363,910,969,1170,1175,1177,1181,1184,1187,1321,1325 'partition_counts.select':1196 'partition_counts.show':1193 'partitionbi':368 'pattern':33,271,272,377,602,730,876,985,1089 'per':149,1324 'perform':20,47,118,153,1376 'permiss':1431 'persist':606 'physic':1117 'pipelin':28,122 'plan':1106,1118 'practic':83,1285 'pre':454,897 'pre-aggreg':896 'pre-sort':453 'predic':345,1029 'pressur':183,305 'prevent':819 'print':832,836,865,1148,1153,1157,1161,1221 'process':27 'product':32,1236 'program':138 'provid':87 'prune':343,1011 'push':358,1044 'pushdown':346,1030 'pyspark':609 'pyspark.sql':198,202,383 'pyspark.sql.types':389 'python':196,276,381,607,734,880,990,1094,1235 'queri':366,1078,1105,1242,1291 'quick':194 'ratio':1216,1224,1226 'read':231,1013,1026 'reader':1047 'recent':1132 'recommend':685 'reduc':126,334,881,968,971 'region':660 'relev':81 'remain':771 'repartit':322,327,966 'requir':78,96,436,1429 'resourc':1374 'resources/implementation-playbook.md':98 'result':247,418,442,493 'result.write.mode':266 'return':311,593 'reus':614 'review':1422 'right':191,1319 'right-siz':190,1318 'row':1005 's3':243,269,349,375,406,412,620,712,1008,1018,1037,1069,1082 'safeti':1432 'salt':171,518,523,532,535,541,546,548,551,555,563,569,575,578,580,588,591,598 'save':1068 'saveast':474 'sc':840 'sc._jsc.sc':845,851 'scale':25,120 'schedul':309 'scope':72,1403 'select':255,1020 'separ':144 'ser':646,688 'serial':174,689,1265 'session':210 'set':234 'sever':520 'sheet':1234 'shuffl':11,44,127,146,158,337,437,457,481,776,823,878,882,900,930,937,951,975,977,1306 'side':400,544,566,934 'simpl':1111 'size':192,285,289,296,315,319,441,884,1320 'skew':130,167,500,505,521,526,538,543,1167,1171,1176,1215,1223,1225,1231,1315,1358 'skill':53,101,1395 'skill-spark-optimization' 'slow':22,104 'small':394,403,421,1064,1303,1308 'snappi':996 'solut':157 'sort':429,455,778 'sort-merg':428 'sortbi':469 'source-sickn33' 'spark':2,6,19,30,37,61,105,117,121,134,209,211,357,739,789,835,1024,1130,1239,1311,1375,1381 'spark-optim':1 'spark-submit':738 'spark.apache.org':1379,1384 'spark.apache.org/docs/latest/configuration.html)':1383 'spark.apache.org/docs/latest/sql-performance-tuning.html)':1378 'spark.conf.set':508,511,514,885,890,893,980,1098,1101 'spark.default.parallelism':1263 'spark.executor.memory':792,1252 'spark.executor.memoryoverhead':795,1254 'spark.io.compression.codec':981,1271 'spark.memory.fraction':753,803,1256 'spark.memory.storagefraction':761,806,1258 'spark.range':573 'spark.read':236 'spark.read.parquet':348,405,411,619,1017,1036 'spark.serializer':225,1266 'spark.shuffle.compress':891,1273 'spark.shuffle.spill.compress':894 'spark.sparkcontext':841 'spark.sparkcontext.setcheckpointdir':711 'spark.sparkcontext.statustracker':1137 'spark.sql':1079 'spark.sql.adaptive.coalescepartitions.enabled':219,1247 'spark.sql.adaptive.enabled':216,1245 'spark.sql.adaptive.skewjoin.enabled':222,509,1249 'spark.sql.adaptive.skewjoin.skewedpartitionfactor':512 'spark.sql.adaptive.skewjoin.skewedpartitionthresholdinbytes':515 'spark.sql.autobroadcastjointhreshold':817,1276 'spark.sql.codegen.wholestage':1099 'spark.sql.execution.arrow.pyspark.enabled':1102,1268 'spark.sql.files.maxpartitionbytes':825,1280 'spark.sql.files.opencostinbytes':1282 'spark.sql.shuffle.partitions':228,809,886,1261 'spark.table':483,487 'sparksess':200 'sparksession.builder':212,790 'specif':639,1417 'spill':184,680,1316 'stage':143,1128,1133,1139,1143,1146,1150,1151 'stage_info.numcompletedtasks':1160 'stage_info.numfailedtasks':1164 'stage_info.numtasks':1156 'start':195 'stat':1195,1217,1219 'statist':1119 'status':626,850,1042,1135 'status._1':857 'status._2':862 'status_tracker.getactivestageids':1142 'status_tracker.getstageinfo':1145 'stddev':1212 'step':89 'stop':1423 'storag':640,667,760,1033 'storagelevel':611 'storagelevel.memory':643 'strategi':41,115 'structur':785 'submit':740 'substitut':1413 'success':1435 'sum':661,917,926 'tabl':395,435,463,479,1304 'take':1370 'task':56,147,169,308,1124,1155,1399 'tbs':414 'templat':1238 'test':1419 'tight':699 'time':460,617,1362 'toarray':848 'tool':69 'topic-agent-skills' 'topic-agentic-skills' 'topic-ai-agent-skills' 'topic-ai-agents' 'topic-ai-coding' 'topic-ai-workflows' 'topic-antigravity' 'topic-antigravity-skills' 'topic-claude-code' 'topic-claude-code-skills' 'topic-codex-cli' 'topic-codex-skills' 'total':265,855,868,871,928 'tracker':1136 'transform':165,246 'treat':1408 'trigger':140 'true':217,220,223,510,892,895,1056,1062,1100,1103,1246,1248,1250,1269,1274 'ts':1327 'tune':15,48,107,185,733,1377 'tungsten':702 'udf':1339 'ui':1312 'under-util':301 'uneven':168 'unnecessarili':1340 'unpersist':663 'unrel':58 'usag':830,834,839 'use':16,51,99,177,650,962,1296,1338,1341,1365,1369,1393 'user':783 'util':303 'valid':85,1418 'valu':914 'verif':91 'wide':164 'withcolumn':554,579,1180 'withcolumnrenam':576 'write':361,461 'x':1228 'year':369 'z':1072 'z-order':1071 'zorder':1084","prices":[{"id":"e5bf2b03-2f79-426a-a967-3a45b5b93f4d","listingId":"170f7517-4b26-495f-918e-86c171720a10","amountUsd":"0","unit":"free","nativeCurrency":null,"nativeAmount":null,"chain":null,"payTo":null,"paymentMethod":"skill-free","isPrimary":true,"details":{"org":"sickn33","category":"antigravity-awesome-skills","install_from":"skills.sh"},"createdAt":"2026-04-18T21:45:15.756Z"}],"sources":[{"listingId":"170f7517-4b26-495f-918e-86c171720a10","source":"github","sourceId":"sickn33/antigravity-awesome-skills/spark-optimization","sourceUrl":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/spark-optimization","isPrimary":false,"firstSeenAt":"2026-04-18T21:45:15.756Z","lastSeenAt":"2026-04-22T12:51:48.898Z"}],"details":{"listingId":"170f7517-4b26-495f-918e-86c171720a10","quickStartSnippet":null,"exampleRequest":null,"exampleResponse":null,"schema":null,"openapiUrl":null,"agentsTxtUrl":null,"citations":[],"useCases":[],"bestFor":[],"notFor":[],"kindDetails":{"org":"sickn33","slug":"spark-optimization","github":{"repo":"sickn33/antigravity-awesome-skills","stars":34515,"topics":["agent-skills","agentic-skills","ai-agent-skills","ai-agents","ai-coding","ai-workflows","antigravity","antigravity-skills","claude-code","claude-code-skills","codex-cli","codex-skills","cursor","cursor-skills","developer-tools","gemini-cli","gemini-skills","kiro","mcp","skill-library"],"license":"mit","html_url":"https://github.com/sickn33/antigravity-awesome-skills","pushed_at":"2026-04-22T06:40:00Z","description":"Installable GitHub library of 1,400+ agentic skills for Claude Code, Cursor, Codex CLI, Gemini CLI, Antigravity, and more. Includes installer CLI, bundles, workflows, and official/community skill collections.","skill_md_sha":"be026777d229a6c10ca3874def15363b9068ae0c","skill_md_path":"skills/spark-optimization/SKILL.md","default_branch":"main","skill_tree_url":"https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/spark-optimization"},"layout":"multi","source":"github","category":"antigravity-awesome-skills","frontmatter":{"name":"spark-optimization","description":"Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines."},"skills_sh_url":"https://skills.sh/sickn33/antigravity-awesome-skills/spark-optimization"},"updatedAt":"2026-04-22T12:51:48.898Z"}}