Master Jobs, Stages, and Tasks for Data Engineering Interviews

Image
Mastering Spark execution internals is a "must-have" skill for Data Engineers. Whether you are prepping for an interview or debugging a slow production pipeline, understanding how Spark breaks down your code is the key to performance tuning. Spark applications follow a strict hierarchy: Jobs > Stages > Tasks . Let’s break down exactly how this works. 1. High-Level Architecture Before we dive into the code, let’s look at the components that manage the execution: Driver: The brain. It converts your code into a Directed Acyclic Graph (DAG) and schedules tasks. DAG Scheduler: Splits the graph into Stages based on "shuffles." Task Scheduler: Sends the individual Tasks to the executors. Executors: The workers that actually run the tasks in parallel. 2. Real-World Code Walkthrough: The "Wide" Transformation Let’s analyze a common scenario: reading data, filtering, grouping, and saving. # 1. Read Data (Narrow) df = sp...

NVL vs COALESCE

 NVL vs COALESCE to handle NULL values in SQL.



Both NVL and COALESCE are used in SQL to handle null values, but they have some differences:

Syntax: NVL takes two arguments, while COALESCE takes two or more arguments.

Return value: NVL returns the first argument if it is not null, otherwise it returns the second argument. COALESCE returns the first non-null value from its arguments.

Here are some examples to illustrate the differences:

NVL Example:


SELECT NVL(NULL, 'hello') FROM dual;
This will return 'hello', since the first argument is null.


SELECT NVL('world', 'hello') FROM dual;
This will return 'world', since the first argument is not null.

COALESCE Example:

SELECT COALESCE(NULL, NULL, 'hello', 'world') FROM dual;

This will return 'hello', since it is the first non-null value.


SELECT COALESCE(NULL, 'hello', 'world') FROM dual;
This will also return 'hello', since it is the first non-null value.

Hope it helps.

#sql #null #handling

Comments

Popular posts from this blog

How Delta Lake Improves Query Performance with OPTIMIZE and File Compaction

Z-Ordering in Delta Lake: Boosting Query Performance

Spark Execution Internals: Deconstructing Jobs, Stages, and Shuffles