Databricks Pyspark

If Delta Lake Uses Immutable Files, How Do UPDATE, DELETE, and MERGE Work?

Get link
Facebook
X
Pinterest
Email
Other Apps

By Raman Gupta - January 19, 2026

Listen and Watch here One of the most common questions data engineers ask is: if Delta Lake stores data in immutable Parquet files, how can it support operations like UPDATE , DELETE , and MERGE ? The answer lies in Delta Lake’s transaction log and its clever file rewrite mechanism. 🔍 Immutable Files in Delta Lake Delta Lake stores data in Parquet files, which are immutable by design. This immutability ensures consistency and prevents accidental corruption. But immutability doesn’t mean data can’t change — it means changes are handled by creating new versions of files rather than editing them in place. ⚡ How UPDATE Works When you run an UPDATE statement, Delta Lake: Identifies the files containing rows that match the update condition. Reads those files and applies the update logic. Writes out new Parquet files with the updated rows. Marks the old files as removed in the transaction log. UPDATE people SET age = age + 1 WHERE country = 'India'; Result: ...

Databricks Pyspark

Get link
Facebook
X
Pinterest
Email
Other Apps

By Raman Gupta - May 28, 2023

Check this link : Previous Blog

Blog is about :

1. How to find a particular column in a database which is having n number of tables.

2. Calculate time taken by a code snippets or a notebook in databricks.

here is the link for previous blog

Get link
Facebook
X
Pinterest
Email
Other Apps

Comments

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently

By Raman Gupta - June 08, 2025

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently 🚀 Sizing a Databricks Cluster for 10 TB: A Step-by-Step Optimization Guide Processing 10 TB of data in Databricks may sound intimidating, but with a smart cluster sizing strategy, it can be both fast and cost-effective . In this post, we’ll walk through how to determine the right number of partitions, nodes, executors, and memory to optimize Spark performance for large-scale workloads. 📌 Step 1: Estimate the Number of Partitions To unlock Spark’s parallelism, data must be split into manageable partitions . Data Volume: 10 TB = 10,240 GB Target Partition Size: ~128 MB (0.128 GB) Formula: 10,240 / 0.128 = ~80,000 partitions 💡 Tip: Use file formats like Parquet or Delta Lake to ensure partitions are splittable. 📌 Step 2: Determine Number of Nodes Assuming each node handles 100–200 partitions effectively: Without overhead: 80,000 / 100–200 = 400 to 800...

5 Reasons Your Spark Jobs Are Slow — and How to Fix Them Fast

By Raman Gupta - June 07, 2025

🚀 Why Your Spark Pipelines Are Slow: The 5 Core Bottlenecks (and How to Fix Them) Apache Spark is renowned for its ability to handle massive datasets with blazing speed and scalability. But if your Spark pipelines are dragging their feet, there’s a good chance they’re falling into one (or more) of the five core performance traps . This post dives into the five fundamental reasons why Spark jobs become slow, along with practical tips to diagnose and fix each one. Mastering these can make the difference between a sluggish pipeline and one that completes in seconds. ┌──────────────┐ │ Input File │ └─────┬────────┘ ▼ ┌─────────────┐ ...

If Delta Lake Uses Immutable Files, How Do UPDATE, DELETE, and MERGE Work?

By Raman Gupta - January 19, 2026

Search This Blog

The Data Engineer’s Journal

If Delta Lake Uses Immutable Files, How Do UPDATE, DELETE, and MERGE Work?

Databricks Pyspark

Comments

Post a Comment

Popular posts from this blog

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently

5 Reasons Your Spark Jobs Are Slow — and How to Fix Them Fast

If Delta Lake Uses Immutable Files, How Do UPDATE, DELETE, and MERGE Work?