Posts

Showing posts with the label null value handling

Master Jobs, Stages, and Tasks for Data Engineering Interviews

Image
Mastering Spark execution internals is a "must-have" skill for Data Engineers. Whether you are prepping for an interview or debugging a slow production pipeline, understanding how Spark breaks down your code is the key to performance tuning. Spark applications follow a strict hierarchy: Jobs > Stages > Tasks . Let’s break down exactly how this works. 1. High-Level Architecture Before we dive into the code, let’s look at the components that manage the execution: Driver: The brain. It converts your code into a Directed Acyclic Graph (DAG) and schedules tasks. DAG Scheduler: Splits the graph into Stages based on "shuffles." Task Scheduler: Sends the individual Tasks to the executors. Executors: The workers that actually run the tasks in parallel. 2. Real-World Code Walkthrough: The "Wide" Transformation Let’s analyze a common scenario: reading data, filtering, grouping, and saving. # 1. Read Data (Narrow) df = sp...

NVL vs COALESCE

  NVL vs COALESCE to handle NULL values in SQL. Both NVL and COALESCE are used in SQL to handle null values, but they have some differences: Syntax: NVL takes two arguments, while COALESCE takes two or more arguments. Return value: NVL returns the first argument if it is not null, otherwise it returns the second argument. COALESCE returns the first non-null value from its arguments. Here are some examples to illustrate the differences: NVL Example: SELECT NVL(NULL, 'hello') FROM dual; This will return 'hello', since the first argument is null. SELECT NVL('world', 'hello') FROM dual; This will return 'world', since the first argument is not null. COALESCE Example: SELECT COALESCE(NULL, NULL, 'hello', 'world') FROM dual; This will return 'hello', since it is the first non-null value. SELECT COALESCE(NULL, 'hello', 'world') FROM dual; This will also return 'hello', since it is the first non-null value...