Posts

Showing posts from April, 2025

Spark Execution Internals: Deconstructing Jobs, Stages, and Shuffles

Understanding Spark Execution: A Deep Dive If you are working with Big Data, writing code that "works" is only half the battle. To truly master Apache Spark, you need to understand how your code is translated into physical execution. Today, let's break down a specific Spark snippet to see how Jobs, Stages, and Tasks are born. The Scenario Imagine we have the following PySpark code: df = spark.read.parquet("sales") result = (     df.filter("amount > 100")     .select("customer_id", "amount")     .repartition(4)     .groupBy("customer_id")     .sum("amount") ) result.write.mode("overwrite").parquet("output") Our Cluster Constraints: Input Data:  12 partitions. Cluster Hardware:  4 executors, each capable of running 2 tasks simultaneously. Q1. How many Spark Jobs will be created? Answer: 1 Job. In Spark, a  Job  is triggered by an  Action . Transformations (like  filter  or  groupBy ) are lazy...

Introduction to Microsoft Fabric - Unified Analytics Platform

Image
What is Microsoft Fabric? Microsoft Fabric is an enterprise-ready, end-to-end analytics platform. It unifies data movement, data processing, ingestion, transformation, real-time event routing, and report building. It supports these capabilities with integrated services like Data Engineering, Data Factory, Data Science, Real-Time Intelligence, Data Warehouse, and Databases. Fabric provides a seamless, user-friendly SaaS experience. It integrates separate components into a cohesive stack. It centralizes data storage with OneLake and embeds AI capabilities, eliminating the need for manual integration. With Fabric, you can efficiently transform raw data into actionable insights. Capabilities of Fabric Microsoft Fabric enhances productivity, data management, and AI integration. Here are some of its key capabilities: Role-specific workloads:  Customized solutions for various roles within an organization, providing each user with the necessary tools. OneLake:  A unified data lake tha...