Spark Execution Internals: Deconstructing Jobs, Stages, and Shuffles

Understanding Spark Execution: A Deep Dive If you are working with Big Data, writing code that "works" is only half the battle. To truly master Apache Spark, you need to understand how your code is translated into physical execution. Today, let's break down a specific Spark snippet to see how Jobs, Stages, and Tasks are born. The Scenario Imagine we have the following PySpark code: df = spark.read.parquet("sales") result = (     df.filter("amount > 100")     .select("customer_id", "amount")     .repartition(4)     .groupBy("customer_id")     .sum("amount") ) result.write.mode("overwrite").parquet("output") Our Cluster Constraints: Input Data:  12 partitions. Cluster Hardware:  4 executors, each capable of running 2 tasks simultaneously. Q1. How many Spark Jobs will be created? Answer: 1 Job. In Spark, a  Job  is triggered by an  Action . Transformations (like  filter  or  groupBy ) are lazy...

NVL vs COALESCE

 NVL vs COALESCE to handle NULL values in SQL.



Both NVL and COALESCE are used in SQL to handle null values, but they have some differences:

Syntax: NVL takes two arguments, while COALESCE takes two or more arguments.

Return value: NVL returns the first argument if it is not null, otherwise it returns the second argument. COALESCE returns the first non-null value from its arguments.

Here are some examples to illustrate the differences:

NVL Example:


SELECT NVL(NULL, 'hello') FROM dual;
This will return 'hello', since the first argument is null.


SELECT NVL('world', 'hello') FROM dual;
This will return 'world', since the first argument is not null.

COALESCE Example:

SELECT COALESCE(NULL, NULL, 'hello', 'world') FROM dual;

This will return 'hello', since it is the first non-null value.


SELECT COALESCE(NULL, 'hello', 'world') FROM dual;
This will also return 'hello', since it is the first non-null value.

Hope it helps.

#sql #null #handling

Comments

Popular posts from this blog

Spark Execution Internals: Deconstructing Jobs, Stages, and Shuffles

Z-Ordering in Delta Lake: Boosting Query Performance

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently