Search This Blog

The Data Engineer’s Journal

The Data Engineer’s Journal is your go-to resource for the latest insights, tips, and tutorials on data engineering, analytics, and cloud technologies. Whether you're optimizing data pipelines, or exploring cloud platforms, our blog provides actionable content to help professionals stay ahead in the fast-evolving data landscape. Join us on the journey to unlock the full potential of data.

Home
About Us
Contact Us
Privacy Policy
Disclaimer
Terms and Conditions

More…

Posts

Showing posts from May, 2023

Show all

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently

Get link
Facebook
X
Pinterest
Email
Other Apps

By Raman Gupta - June 08, 2025

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently 🚀 Sizing a Databricks Cluster for 10 TB: A Step-by-Step Optimization Guide Processing 10 TB of data in Databricks may sound intimidating, but with a smart cluster sizing strategy, it can be both fast and cost-effective . In this post, we’ll walk through how to determine the right number of partitions, nodes, executors, and memory to optimize Spark performance for large-scale workloads. 📌 Step 1: Estimate the Number of Partitions To unlock Spark’s parallelism, data must be split into manageable partitions . Data Volume: 10 TB = 10,240 GB Target Partition Size: ~128 MB (0.128 GB) Formula: 10,240 / 0.128 = ~80,000 partitions 💡 Tip: Use file formats like Parquet or Delta Lake to ensure partitions are splittable. 📌 Step 2: Determine Number of Nodes Assuming each node handles 100–200 partitions effectively: Without overhead: 80,000 / 100–200 = 400 to 800...

Databricks Pyspark

Get link
Facebook
X
Pinterest
Email
Other Apps

By Raman Gupta - May 28, 2023

Check this link : Previous Blog Blog is about : 1. How to find a particular column in a database which is having n number of tables. 2. Calculate time taken by a code snippets or a notebook in databricks. here is the link for previous blog

Translate

Raman Gupta

Visit profile

Followers

Labels

About Data Engineering
acyclicgraph
Alteryx
alteryx designer
Amazon redshift
ANALYZE
Apache Spark
AWS IAM
Big Data
Broadcast Join

Caching
Cluster
coalesce
Column-Level Security
Column-Oriented
Columnar storage
DAG
Data Activator
Data analysis
Data ceaning
Data Distribution
Data Engineering
Data Factory
Data Pipeline
Data Science
Data Skew
Data Warehouse
Data Warehousing
DatabaseManagement
DatabasePerformance
Databricks
DataCleaning
datalake
DataLogging
datamodel
datawarehouse
Disk Spillage
DistributedComputing
distribution and sort keys
EfficientDatabase
EfficientQueries
ETL
ETL tool
EXPLAIN
Fabric
git
gitcommand
github
I/O
lakehouse
Memory Tuning
Monitoring and debugging
MPP
Node
null value handling
nvl
OLAP
OLTP
Optimization
OptimizeSQL
Parallelism
Performance optimization
Power BI
PySpark
QueryTuning
Real-Time Intelligence
Redshift
Redshift Architecture
Redshift spectrum
S3
Shared-Nothing Architecture
snowflakeschema
Spark
Spark Best Practices
Spark Optimization
Spark Performance
Spark SQL
SQL
SQLOptimization
SQLPerformance
SQLTips
starschema
TechTips
UDF
VACUUM

Show more Show less