Spark Execution Internals: Deconstructing Jobs, Stages, and Shuffles
Understanding Spark Execution: A Deep Dive If you are working with Big Data, writing code that "works" is only half the battle. To truly master Apache Spark, you need to understand how your code is translated into physical execution. Today, let's break down a specific Spark snippet to see how Jobs, Stages, and Tasks are born. The Scenario Imagine we have the following PySpark code: df = spark.read.parquet("sales") result = ( df.filter("amount > 100") .select("customer_id", "amount") .repartition(4) .groupBy("customer_id") .sum("amount") ) result.write.mode("overwrite").parquet("output") Our Cluster Constraints: Input Data: 12 partitions. Cluster Hardware: 4 executors, each capable of running 2 tasks simultaneously. Q1. How many Spark Jobs will be created? Answer: 1 Job. In Spark, a Job is triggered by an Action . Transformations (like filter or groupBy ) are lazy...