Master Jobs, Stages, and Tasks for Data Engineering Interviews

Image
Mastering Spark execution internals is a "must-have" skill for Data Engineers. Whether you are prepping for an interview or debugging a slow production pipeline, understanding how Spark breaks down your code is the key to performance tuning. Spark applications follow a strict hierarchy: Jobs > Stages > Tasks . Let’s break down exactly how this works. 1. High-Level Architecture Before we dive into the code, let’s look at the components that manage the execution: Driver: The brain. It converts your code into a Directed Acyclic Graph (DAG) and schedules tasks. DAG Scheduler: Splits the graph into Stages based on "shuffles." Task Scheduler: Sends the individual Tasks to the executors. Executors: The workers that actually run the tasks in parallel. 2. Real-World Code Walkthrough: The "Wide" Transformation Let’s analyze a common scenario: reading data, filtering, grouping, and saving. # 1. Read Data (Narrow) df = sp...

NVL vs COALESCE

 NVL vs COALESCE to handle NULL values in SQL.



Both NVL and COALESCE are used in SQL to handle null values, but they have some differences:

Syntax: NVL takes two arguments, while COALESCE takes two or more arguments.

Return value: NVL returns the first argument if it is not null, otherwise it returns the second argument. COALESCE returns the first non-null value from its arguments.

Here are some examples to illustrate the differences:

NVL Example:


SELECT NVL(NULL, 'hello') FROM dual;
This will return 'hello', since the first argument is null.


SELECT NVL('world', 'hello') FROM dual;
This will return 'world', since the first argument is not null.

COALESCE Example:

SELECT COALESCE(NULL, NULL, 'hello', 'world') FROM dual;

This will return 'hello', since it is the first non-null value.


SELECT COALESCE(NULL, 'hello', 'world') FROM dual;
This will also return 'hello', since it is the first non-null value.

Hope it helps.

#sql #null #handling

Comments

Popular posts from this blog

Master Jobs, Stages, and Tasks for Data Engineering Interviews

RDD vs DATAFRAME vs DATASET

How Delta Lake Prevents Conflicting Writes Using Optimistic Concurrency Control