Posts

Showing posts from September, 2024

Spark Execution Internals: Deconstructing Jobs, Stages, and Shuffles

Understanding Spark Execution: A Deep Dive If you are working with Big Data, writing code that "works" is only half the battle. To truly master Apache Spark, you need to understand how your code is translated into physical execution. Today, let's break down a specific Spark snippet to see how Jobs, Stages, and Tasks are born. The Scenario Imagine we have the following PySpark code: df = spark.read.parquet("sales") result = (     df.filter("amount > 100")     .select("customer_id", "amount")     .repartition(4)     .groupBy("customer_id")     .sum("amount") ) result.write.mode("overwrite").parquet("output") Our Cluster Constraints: Input Data:  12 partitions. Cluster Hardware:  4 executors, each capable of running 2 tasks simultaneously. Q1. How many Spark Jobs will be created? Answer: 1 Job. In Spark, a  Job  is triggered by an  Action . Transformations (like  filter  or  groupBy ) are lazy...

Git & Git Command

Image
Git is the free and open source distributed version control system that's responsible for everything GitHub related that happens locally on your computer  🚀 Mastering Git Basics 🚀 ⭐ **ls**: List contents inside the folder ⭐ **mkdir <folder_name>**: Create a project ⭐ **cd <folder_name>**: Navigate into a folder ⭐ **git init**: Initialize a Git repository ⭐ **touch names.txt**: Create a new file ⭐ **git status**: Show directory changes ⭐ **git add .**: Add all untracked files ⭐ **git add file.txt**: Add a specific file ⭐ **git commit -m "message"**: Commit changes with a message ⭐ **vi file.txt**: Edit a file ⭐ **cat names.txt**: Display file content ⭐ **git restore --staged files.txt**: Unstage a file ⭐ **git log**: View commit history ⭐ **rm -rf names.txt**: Delete a file ⭐ **git reset <commit id>**: Restore file to a specific commit ⭐ **git stash**: Temporarily store changes ⭐ **git stash pop**: Apply stored changes ⭐ **git stash clear**: Clear stor...