Posts

Showing posts from September, 2024

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently πŸš€ Sizing a Databricks Cluster for 10 TB: A Step-by-Step Optimization Guide Processing 10 TB of data in Databricks may sound intimidating, but with a smart cluster sizing strategy, it can be both fast and cost-effective . In this post, we’ll walk through how to determine the right number of partitions, nodes, executors, and memory to optimize Spark performance for large-scale workloads. πŸ“Œ Step 1: Estimate the Number of Partitions To unlock Spark’s parallelism, data must be split into manageable partitions . Data Volume: 10 TB = 10,240 GB Target Partition Size: ~128 MB (0.128 GB) Formula: 10,240 / 0.128 = ~80,000 partitions πŸ’‘ Tip: Use file formats like Parquet or Delta Lake to ensure partitions are splittable. πŸ“Œ Step 2: Determine Number of Nodes Assuming each node handles 100–200 partitions effectively: Without overhead: 80,000 / 100–200 = 400 to 800...

Git & Git Command

Image
Git is the free and open source distributed version control system that's responsible for everything GitHub related that happens locally on your computer  πŸš€ Mastering Git Basics πŸš€ ⭐ **ls**: List contents inside the folder ⭐ **mkdir <folder_name>**: Create a project ⭐ **cd <folder_name>**: Navigate into a folder ⭐ **git init**: Initialize a Git repository ⭐ **touch names.txt**: Create a new file ⭐ **git status**: Show directory changes ⭐ **git add .**: Add all untracked files ⭐ **git add file.txt**: Add a specific file ⭐ **git commit -m "message"**: Commit changes with a message ⭐ **vi file.txt**: Edit a file ⭐ **cat names.txt**: Display file content ⭐ **git restore --staged files.txt**: Unstage a file ⭐ **git log**: View commit history ⭐ **rm -rf names.txt**: Delete a file ⭐ **git reset <commit id>**: Restore file to a specific commit ⭐ **git stash**: Temporarily store changes ⭐ **git stash pop**: Apply stored changes ⭐ **git stash clear**: Clear stor...