Apache Spark DataFrame ETL Pipeline
Automates PySpark DataFrame transformations including schema inference, partition pruning, and Delta Lake merge operations. Integrates with AWS Glue Data Catalog and Apache Iceberg table formats for lakehouse architectures.
What it does
Apache Spark DataFrame ETL Pipeline
Automates PySpark DataFrame transformations including schema inference, partition pruning, and Delta Lake merge operations. Integrates with AWS Glue Data Catalog and Apache Iceberg table formats for lakehouse architectures.
Installation
Requirements and caveats from upstream:
- high-level APIs in Scala, Java, Python, and R (Deprecated), and an optimized engine that
-
Interactive Python Shell
- Alternatively, if you prefer Python, you can use the Python shell:
Basic usage or getting-started notes:
-
To build Spark and its example programs, run:
-
And run the following command, which should also return 1,000,000,000:
-
Example Programs
-
Source: https://github.com/apache/spark
-
Extracted from upstream docs: https://raw.githubusercontent.com/apache/spark/HEAD/README.md
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (935 chars)