If Delta Lake Uses Immutable Files, How Do UPDATE, DELETE, and MERGE Work?

Listen and Watch here One of the most common questions data engineers ask is: if Delta Lake stores data in immutable Parquet files, how can it support operations like UPDATE , DELETE , and MERGE ? The answer lies in Delta Lake’s transaction log and its clever file rewrite mechanism. 🔍 Immutable Files in Delta Lake Delta Lake stores data in Parquet files, which are immutable by design. This immutability ensures consistency and prevents accidental corruption. But immutability doesn’t mean data can’t change — it means changes are handled by creating new versions of files rather than editing them in place. ⚡ How UPDATE Works When you run an UPDATE statement, Delta Lake: Identifies the files containing rows that match the update condition. Reads those files and applies the update logic. Writes out new Parquet files with the updated rows. Marks the old files as removed in the transaction log. UPDATE people SET age = age + 1 WHERE country = 'India'; Result: ...

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently

🚀 Sizing a Databricks Cluster for 10 TB: A Step-by-Step Optimization Guide

Processing 10 TB of data in Databricks may sound intimidating, but with a smart cluster sizing strategy, it can be both fast and cost-effective. In this post, we’ll walk through how to determine the right number of partitions, nodes, executors, and memory to optimize Spark performance for large-scale workloads.


📌 Step 1: Estimate the Number of Partitions

To unlock Spark’s parallelism, data must be split into manageable partitions.

  • Data Volume: 10 TB = 10,240 GB
  • Target Partition Size: ~128 MB (0.128 GB)
  • Formula: 10,240 / 0.128 = ~80,000 partitions

💡 Tip: Use file formats like Parquet or Delta Lake to ensure partitions are splittable.


📌 Step 2: Determine Number of Nodes

Assuming each node handles 100–200 partitions effectively:

  • Without overhead: 80,000 / 100–200 = 400 to 800 tasks → ~400–160 nodes
  • With Node Utilization Factor (NUF = 50%–80%): Adjusted estimate = ~100–200 nodes

💡 Tip: Start with 0.5× node utilization and monitor CPU/memory via Databricks Spark UI.


📌 Step 3: Configure Executors and Cores

  • Target cores per executor: 5–10
  • Total cores: 100 nodes × 5 cores = 500–2,000 cores
  • Executors: Total cores / Cores per executor = ~50–200

💡 Tip: Don’t overcommit CPU. Ensure each executor has enough memory to avoid GC overhead.


📌 Step 4: Estimate Total Memory

Memory is vital for shuffle-heavy operations and caching.

  • Memory per executor: 10–20 GB
  • Total executors: 50–200
  • Total memory: 500 GB to 4 TB

💡 Tip: Allocate ~60% of executor memory to Spark execution. Don’t forget memory for overhead and JVM.


📌 Step 5: Choose the Right Storage Format

  • 🔹 Parquet
    Best for read-heavy workloads. Columnar, compressed, and supports predicate pushdown.
  • 🔹 Delta Lake
    Built on Parquet with ACID, schema evolution, and time travel. Ideal for batch + streaming.
  • 🔹 Apache Hudi
    Best for upsert-heavy or CDC use cases in lakehouses.
  • 🔹 ORC
    Great for Hive-based stacks (e.g., EMR, Athena). Less common in Spark-native stacks.

✅ Recommendation: Use Parquet or Delta Lake in Databricks unless your workload demands Hudi or ORC.


✅ Summary Table

Component Estimate
Partitions ~80,000
Nodes 100–200
Executor Cores 500–2,000
Executors 50–200
Total Memory 500 GB – 4 TB
Recommended Format Parquet / Delta Lake

🔍 Final Thoughts

When it comes to big data, planning is performance. By estimating your cluster needs based on data volume, partition size, executor configuration, and memory, you set yourself up for scalable, efficient processing.

Databricks is powerful—but only if you configure it to match your workload. This guide gives you a practical blueprint to process 10 TB of data without wasting compute or budget.


Have thoughts or questions? Drop them in the comments! 🔽

Comments

Popular posts from this blog

5 Reasons Your Spark Jobs Are Slow — and How to Fix Them Fast

If Delta Lake Uses Immutable Files, How Do UPDATE, DELETE, and MERGE Work?