How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently 🚀 Sizing a Databricks Cluster for 10 TB: A Step-by-Step Optimization Guide Processing 10 TB of data in Databricks may sound intimidating, but with a smart cluster sizing strategy, it can be both fast and cost-effective . In this post, we’ll walk through how to determine the right number of partitions, nodes, executors, and memory to optimize Spark performance for large-scale workloads. 📌 Step 1: Estimate the Number of Partitions To unlock Spark’s parallelism, data must be split into manageable partitions . Data Volume: 10 TB = 10,240 GB Target Partition Size: ~128 MB (0.128 GB) Formula: 10,240 / 0.128 = ~80,000 partitions 💡 Tip: Use file formats like Parquet or Delta Lake to ensure partitions are splittable. 📌 Step 2: Determine Number of Nodes Assuming each node handles 100–200 partitions effectively: Without overhead: 80,000 / 100–200 = 400 to 800...

Amazon Redshift

                                                     Amazon Redshift 

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.

Amazon redshift is cloud managed, column oriented, MPP data warehouse

Columnar Storage advantage:
  • reduces the overall disk I/O requirements
  • reduces the amount of data you need to load from disk
  • each bock holds the same data type
  • block data can use a compression scheme selected specifically for the column data type
example: 
row based ->>  name, age, adress  --> in row based one block contains only one record
 ram,   26, hyb
column based->> only one column will be saved in one block

Redshift Architecture in Detail

Amazon Redshift is designed as a fully managed, column orientedmassively parallel processing (MPP) data warehouse with a shared-nothing architecture

It consists of the following core components:

1. Cluster
A Redshift cluster consists of one Leader Node and one or more Compute Nodes.
Each node type (Leader/Compute) has specific roles:

a. Leader Node:

                 Manages communication with clients.
         Parses and optimizes SQL queries.
         Distributes the query plan to Compute Nodes.
         Aggregates results from Compute Nodes and sends them to the client.

b. Compute Nodes:

                Execute the query plan in parallel.
        Store the data in columnar format on local disks.
        Return intermediate results to the Leader Node.

2. Node Types

a. Dense Compute (DC):
SSD storage.
High performance for compute-intensive workloads.
b. Dense Storage (DS):
HDD storage.
Cost-effective for large datasets with lower performance requirements.
3. Data Distribution
  • Data is distributed across Compute Nodes using Distribution Keys to ensure even distribution and minimize data movement during queries.
  • Each node divides data further into Slices (subunits), with each slice handling a portion of the data.
4. Storage
  • Redshift uses columnar storage, which stores data by columns instead of rows, optimizing for analytics and reducing disk I/O.
  • Each column is stored in blocks, and zone maps track the minimum and maximum values in each block.
5. Query Execution
  • Queries are executed in parallel by distributing tasks across Compute Nodes and their slices.
  • Redshift employs Massively Parallel Processing (MPP) to optimize query performance for large-scale data.
6. Backup and Recovery
  • Redshift automatically backs up data to Amazon S3 and allows for point-in-time recovery.
  • Snapshots can also be taken manually.
  • Allows querying data directly from S3 without moving it into Redshift.
  • Uses the same SQL interface, integrating seamlessly with data stored externally.
8. Security
  • Redshift supports encryption at rest and in transit.
  • It integrates with AWS IAM for access control and provides fine-grained access with column-level security.
Key Features of Redshift

1. Scalable Data Warehousing:
  • Redshift can scale to process petabytes of data.
  • Offers both provisioned clusters and serverless options for flexibility.
2. Columnar Storage:
  • Uses column-oriented databases and data compression to optimize query performance and storage.
3. Massively Parallel Processing (MPP):
  • Distributes data and query load across multiple nodes, enabling high-performance analysis.
4. Integration with BI Tools:
  • Supports integration with tools like Tableau, Looker, and AWS services like QuickSight.
5. SQL-Based Queries:
  • Redshift is compatible with PostgreSQL and allows SQL-based querying.
6. Data Sharing and Analytics:
  • Redshift Spectrum allows querying data directly from Amazon S3 without moving it into Redshift.
  • Features like Materialized Views and Redshift ML enable advanced analytics and machine learning integration.
7. Security and Compliance:
  • Offers features like encryption at rest and in transit, fine-grained access control, and compliance with various standards (e.g., GDPR, HIPAA).
8. Cost Optimization:
  • Pay-as-you-go pricing with reserved instance options for cost savings.
  • Data compression reduces storage costs.                
How Columnar Storage Works in Redshift?

Amazon Redshift stores data in a columnar format. This design is a key feature that helps optimize both storage and query performance for large-scale data warehousing and analytics workloads.

1. Columnar Data Layout:
  • Instead of storing data row by row (like traditional row-oriented databases), Redshift stores data column by column. For example, if you have a table with columns A, B, and C, Redshift stores all the values for column A together, then column B, and so on. 
2. Advantages of Columnar Storage:
  • Query Performance: Queries often target only a subset of columns in a table. Redshift reads only the relevant columns instead of scanning the entire table, reducing I/O and improving performance.
  • Compression: Columnar data is highly compressible because values in a column tend to be similar or repetitive, saving storage space.
  • Efficient Aggregation: Operations like SUM, AVG, and COUNT are faster since Redshift processes only the necessary columns.
3. Use of Sort Keys:
  • Redshift allows you to define sort keys to organize data within each column. This enhances the efficiency of range-based queries by enabling Redshift to skip irrelevant blocks of data.
4. Zone Maps:
  • Redshift uses metadata structures called zone maps to keep track of the minimum and maximum values for each block of data in a column. During a query, it can quickly eliminate blocks that don't match the query criteria. This columnar storage design makes Redshift particularly well-suited for analytical queries, where large datasets are often queried to generate insights based on aggregations, filtering, and joins.

Query life cycle in Redshift

The query lifecycle in Amazon Redshift involves several key stages, from query submission to the return of results. This process ensures that queries are optimized for performance and executed efficiently across the cluster.


1. Query Submission

A user or application submits a SQL query to the Redshift cluster using a client (e.g., SQL workbench, BI tool, or application). The query is received by the Leader Node, which acts as the orchestrator for query execution.

2. Query Parsing

The Leader Node parses the query to ensure it is syntactically correct. It performs semantic checks to validate table and column references, data types, and other constraints.

3. Query Optimization

The Leader Node optimizes the query to determine the most efficient execution plan:

          1. Query Rewrite:

    • Simplifies or transforms the query for better performance (e.g., replacing subqueries with joins).

2. Statistics Utilization:

    • Uses metadata and table statistics (e.g., row counts, data distribution) to plan efficient query execution.

3. Query Plan Generation:

    • Creates an execution plan, which defines how the data will be accessed, filtered, joined, and aggregated.

This includes:

    • Selection of Execution Nodes: Identifies which Compute Nodes will process parts of the query.
    • Use of Sort Keys and Distribution Keys: Determines how data is accessed based on table design.
    • Parallel Processing: Breaks the query into multiple steps for execution across nodes.

4. Query Dispatch

  • The optimized query plan is dispatched to the Compute Nodes by the Leader Node.
  • The plan includes instructions for each step of execution, specifying which Compute Nodes will handle which data.

5. Query Execution

  • The Compute Nodes execute the query in parallel using the columnar data stored locally.

        Key operations during execution:

    • Scanning: Reads data blocks from disk, guided by sort keys and zone maps to minimize I/O.
    • Filtering and Projection: Applies filters and extracts only required columns.
    • Joining and Aggregating: Processes joins, aggregations, and other operations as defined in the query.

Intermediate results are sent back to the Leader Node for final aggregation or processing.

6. Query Results Aggregation

  • The Leader Node aggregates or combines intermediate results from the Compute Nodes.
  • If necessary, the Leader Node performs final operations like sorting, grouping, or additional calculations.

7. Query Results Delivery

  • The final result is returned to the client or application that submitted the query.
Performance Optimization Features in the Lifecycle
  • Caching: Redshift caches query results to improve performance for repeated queries.
  • Concurrency Scaling: Automatically adds resources during high query loads to maintain performance.
  • Workload Management (WLM): Redshift allows you to define query queues with priorities and resource limits to ensure critical queries are processed efficiently.
Monitoring and Debugging
  • Use tools like Amazon Redshift Console, AWS CloudWatch, and EXPLAIN statements to monitor and debug query performance at any stage of the lifecycle.
  • This lifecycle ensures that queries are executed efficiently, leveraging Redshift's distributed architecture and columnar storage for high performance.
All key uses in redshift

Amazon Redshift uses various types of keys to optimize performance and data distribution. These keys help manage how data is stored, queried, and retrieved efficiently across the cluster.

1. Distribution Keys: It determines how data is distributed across the Compute Nodes in the Redshift cluster.

Types of Distribution Styles:

1. KEY Distribution:
    • Rows are distributed based on the values in a specified column.
    • Rows with the same value are stored on the same node.
    • Ideal for tables frequently joined on the specified column.
2. EVEN Distribution:
    • Rows are distributed evenly across all nodes.
    • Best for tables with no common join columns or when the data size is uniform.
3. ALL Distribution:
    • Entire table is replicated on all nodes.
    • Suitable for small tables that are frequently joined with other tables (e.g., lookup tables).

Example:

CREATE TABLE sales (
sale_id INT,
customer_id INT DISTKEY, -- Distribution Key
amount DECIMAL
)
DISTSTYLE KEY;

Key Benefits:
  • Ensures better query performance by reducing data movement during joins and aggregations.

2. Sort Keys: It defines the physical order of data on disk. They optimize query performance by enabling efficient range-based filtering and reducing I/O.

Types of Sort Keys:

1. COMPOUND Sort Key:
    • Sorts data in the order of the specified columns.
    • Queries benefit if filtering or grouping uses a prefix of these columns.

2. INTERLEAVED Sort Key:
    • Gives equal weight to all specified columns.
    • Better for queries that filter on multiple columns not in sequential order.
Example:

CREATE TABLE orders (
        order_id INT,
        order_date DATE,
        total_amount DECIMAL
        )
        COMPOUND SORTKEY(order_date, order_id);
------------------------------------------------------------------------
        CREATE TABLE orders (
        order_id INT,
        order_date DATE,
        total_amount DECIMAL
        )
        INTERLEAVED SORTKEY(order_date, order_id);

Key Benefits:
  • Reduces query scan times for range queries.
  • Works well with metadata like zone maps.
3. Primary Keys and Foreign Keys
  • Primarily for data integrity rather than query optimization. Redshift does enforce these constraints also uses them as metadata for query planning.
Primary Key: Ensures uniqueness of rows within a table.
Foreign Key: Establishes a relationship between tables for join optimization.

Example:

CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100)
);

CREATE TABLE orders (
order_id INT,
customer_id INT REFERENCES customers(customer_id) -- Foreign Key
);

Key Benefits:
  • Helps query planner optimize joins and relationships in queries.
4. Compound Keys
  • Allows defining multiple columns as a key in primary keys or foreign keys to represent composite relationships.
When you have a combination of columns uniquely identifying rows.

Example:

CREATE TABLE employee_projects (
employee_id INT,
project_id INT,
PRIMARY KEY (employee_id, project_id)
);

Key Benefits:
  • Useful for multi-dimensional data relationships.
5. Unique Keys
  • Defines constraints to ensure column values are unique within the table.
  • Redshift does not enforce uniqueness but uses this metadata for query optimization.
Example:

CREATE TABLE products (
product_id INT UNIQUE,
product_name VARCHAR(100)
);

6. Distribution and Sort Key Selection Strategy

Choosing the right Distribution and Sort Keys is critical for optimizing performance:

Distribution Key:
    • Should be chosen based on the columns used in joins or aggregations.
    • Avoid skewed data distribution by selecting a key with a wide range of distinct values.
Sort Key:
    • Choose based on the columns used in WHERE, GROUP BY, and ORDER BY clauses.
    • For time-series data, use COMPOUND with time columns as the primary key.

surrogate key : A surrogate key is an artificial or synthetic key used as a unique identifier for each record in a table. Unlike a natural key, which is derived from existing data (like email or SSN), a surrogate key is typically an auto-incremented integer or a UUID that has no business meaning.

Redshift does not natively support auto-increment columns, so surrogate keys are typically generated using workarounds

such as: IDENTITY Column: Generates sequential surrogate keys directly in Redshift.

CREATE TABLE customers (
    customer_id BIGINT IDENTITY(1,1), -- Surrogate Key
    customer_name VARCHAR(100),
    email VARCHAR(100)
);

Syntax:
IDENTITY(seed, step): seed is the starting value, and step is the increment.
Example: IDENTITY(1,1) starts at 1 and increments by 1 for each new row.

Using ROW_NUMBER():
A surrogate key can be generated manually using the ROW_NUMBER() function in a SELECT query.

CREATE TABLE customers_with_key AS
SELECT ROW_NUMBER() OVER() AS customer_id, 
customer_name, email FROM customers;

ETL Tools:
Surrogate keys can be added during ETL processes using tools like AWS Glue, Apache Airflow, or third-party platforms (e.g., Informatica, Talend).

UUIDs:
If you need globally unique surrogate keys, you can use a UUID generator. This is less common in Redshift due to storage inefficiency compared to integers.

Benefits of Surrogate Keys:
  • Decouples Data from Business Logic: Changes in business logic (e.g., a customer changing their email) won’t affect surrogate keys, avoiding cascading updates.
  • Improved Query Performance: Integer-based surrogate keys are faster for joins compared to multi-column natural keys or longer strings.
  • Consistency Across Tables: Useful when integrating data from disparate systems.
When to Use Surrogate Keys in Redshift
  • Fact and Dimension Tables: In star schema models, surrogate keys are used in dimension tables as primary keys and in fact tables as foreign keys.
  • Data Consolidation: When merging data from multiple sources with potentially conflicting or overlapping natural keys.
  • Handling Nulls and Missing Values: Natural keys might have nulls or duplicates; surrogate keys avoid these issues.
Optimization Strategies in Redshift

  • Data Distribution: Choose Distribution Keys based on frequently joined columns to minimize data movement.
  • Sorting: Use Sort Keys for range-based queries (e.g., time-series data).
  • Compression: Redshift automatically applies compression; verify and modify if necessary
  • Query Rewrite: Refactor complex queries to leverage Redshift’s parallel processing and columnar storage.
  • Vacuum and Analyze: Regularly run VACUUM and ANALYZE commands to optimize storage and query plans.

View and Materialized view in Redshift

1. Views in Amazon Redshift

A view in Redshift is a virtual table created by querying other tables. It does not store data physically but stores the SQL query that generates the result set whenever the view is queried.

Key Features of Views

  • Dynamic Execution: Every time a view is queried, Redshift executes the underlying query dynamically to retrieve the current data.
  • Simplified Querying: Allows complex queries to be reused without re-writing them.
  • Security and Access Control: Permissions can be granted on views to restrict direct access to underlying tables.

Syntax to Create a View and Example

                CREATE VIEW top_selling_products AS

                 SELECT product_id, SUM(sales_amount) AS total_sales

FROM sales GROUP BY product_id HAVING total_sales > 10000;

Limitations of Views in Redshift

  • Performance: As views are not materialized, they re-execute the underlying query each time they are called.
  • Nested Views: Redshift has limited support for complex nested views.
  • Dependency Management: Dropping underlying objects can break the view.

2. Materialized Views in Amazon Redshift

A materialized view stores the results of a query physically on disk, unlike a regular view. It provides faster query performance since the data is pre-computed and stored.

Key Features of Materialized Views

  • Pre-Computed Results: Stores the query result, reducing execution time for complex or frequently used queries.
  • Refreshable: Can be refreshed manually or automatically to keep the data up-to-date.
  • Performance Optimization: Useful for aggregations and joins where recalculating data is expensive.

Syntax to Create a Materialized View and Example

                CREATE MATERIALIZED VIEW monthly_sales AS 

                SELECT product_id, DATE_TRUNC('month', sale_date) AS sale_month, 

                SUM(sales_amount) AS total_sales FROM sales GROUP BY product_id, sale_month;

Refreshing a Materialized View

Manual Refresh:

    REFRESH MATERIALIZED VIEW mv_name;

Auto Refresh (requires setup):

  • Enable auto-refresh for materialized views in specific scenarios, such as scheduled ETL workflows.
  • Amazon Redshift does not support automatic refresh for standard views or fully automated refresh for materialized views natively. 

However, you can implement auto-refreshing materialized views through scheduled jobs using tools like AWS Lambda, Amazon EventBridge, or SQL-based task schedulers like Redshift Query Editor v2 or dbt (Data Build Tool).

Limitations of Materialized Views

  • Storage: Materialized views consume disk space as they store query results.
  • Staleness: Data in materialized views is not real-time and needs to be refreshed to reflect changes in the underlying tables.
  • Unsupported Features: Certain queries like UNION or WINDOW functions are not supported.



When to Use Each

Use Views:
  • For queries requiring real-time, up-to-date data.
  • When working with lightweight queries or those that don’t need frequent optimization.
Use Materialized Views:
  • For complex aggregations, joins, or calculations that don’t change frequently.
  • When optimizing for performance in analytic dashboards or reporting.

Schema Binding in Amazon Redshift

In Amazon Redshift, schema binding refers to the ability to bind a view or a materialized view to the underlying schema in a way that ensures no changes are made to the referenced objects without first updating or removing the dependent view.

Key Concept of Schema Binding:
  • Schema binding restricts certain operations on the underlying tables and views that are referenced by the view.
  • When you create a view or materialized view with schema binding enabled, the database ensures that no changes can be made to the underlying objects (such as columns or tables) without invalidating or dropping the dependent view first.
Use of Schema Binding
  • Schema binding ensures that the view stays consistent and prevents accidental changes that could break or cause errors in the dependent view. 
  • This is particularly useful in ensuring that the structure of the underlying tables or views doesn’t change unexpectedly and disrupt the dependent view or materialized view.
Schema Binding for Views : 

 In Redshift, you can create a view with schema binding by using the WITH SCHEMABINDING option in the CREATE VIEW statement.

Syntax:
CREATE VIEW view_name 
WITH SCHEMABINDING 
AS SELECT column1, column2 FROM table_name WHERE condition;

WITH SCHEMABINDING ensures that the referenced tables and columns cannot be modified (e.g., dropped or altered) unless the view is dropped first.

Example:
CREATE VIEW total_sales_view
WITH SCHEMABINDING
AS SELECT product_id, SUM(sales_amount) AS total_sales
FROM sales GROUP BY product_id;

In this example, if user tries to alter or drop the sales table or any of its columns, Redshift will prevent it unless the total_sales_view is dropped or altered.


Advantages of Schema Binding
  • Data Integrity: Ensures that views always return valid and consistent data because the structure of the underlying tables cannot change without considering the impact on the views.
  • Protects Views from Accidental Changes: Avoids errors where a table’s structure changes (like removing or renaming a column) which could break the dependent views.
  • Prevents DDL Changes Without Proper Handling: If someone tries to drop or modify the structure of a table referenced by a view, Redshift will block it until the view is removed or altered, ensuring that the integrity of the query logic is maintained.

Limitations of Schema Binding
  • No Altering of Tables: When a view has schema binding enabled, you cannot alter the structure of the underlying tables (e.g., renaming columns, changing data types) unless the view is first dropped or modified.
  • No Dropping of Columns or Tables: You cannot drop a table or column that is used in a schema-bound view until the view is removed.
When to Use Schema Binding
  • Data Security: When you want to make sure that the data model doesn’t change without explicit review, ensuring data integrity for reports or downstream analytics.
  • Complex Data Models: In scenarios where complex queries are used in multiple reports or dashboards, and you want to ensure that changes to underlying data structures don’t break the logic.
  • ETL Pipelines: When building ETL pipelines, schema binding ensures that no changes occur to the source tables without taking into account the downstream views.

SCENARIO

Example Use Case: Schema Binding in Redshift

You have a sales data warehouse, and you are creating a view to aggregate total sales per product. If someone alters the sales table (e.g., drops a column or changes the column type), it would break the total_sales_view if schema binding was not used.  By enabling schema binding, Redshift will prevent modifications to the sales table until the view is dropped or updated.

-- Create the view with schema binding

CREATE VIEW total_sales_view
WITH SCHEMABINDING
AS SELECT product_id, SUM(sales_amount) AS total_sales
FROM sales GROUP BY product_id;

-- Prevent accidental changes to the sales table
-- Altering or dropping columns from the sales table will fail unless the view is dropped or modified.




Comments

Popular posts from this blog

5 Reasons Your Spark Jobs Are Slow — and How to Fix Them Fast

How to Configure a Databricks Cluster to Process 10 TB of Data Efficiently