Common Key Terms & Terminologies
- Get link
- X
- Other Apps
scroll down or do CTRL + F if you don't find any term on top........................................
Data warehouse
A Data Warehouse (DWH) is a centralized repository designed for storing, managing, and analyzing large volumes of structured data from multiple sources. It enables businesses to perform complex queries, generate reports, and gain insights for decision-making.
Key Characteristics:
- Subject-Oriented: Organized around key business areas (e.g., sales, finance).
- Integrated: Combines data from different sources into a unified format.
- Time-Variant: Stores historical data for trend analysis.
- Non-Volatile: Data is read-only and does not change once stored.
Common Technologies:
- On-Premise: SQL Server, Oracle, Teradata
- Cloud-Based: Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse Analytics
A data warehouse supports Business Intelligence (BI) and analytics by providing structured, cleaned, and optimized data for reporting and decision-making.
----------------------------------------------------------------------------------------------------------------------------
Massively Parallel Processing (MPP) is a computing architecture that enables the simultaneous execution of multiple processes across multiple processors or nodes. It is commonly used in large-scale data processing and analytics, especially in data warehouses and big data environments.
Key Features of MPP:
- Parallel Execution: Workloads are divided and processed simultaneously across multiple nodes.
- Shared-Nothing Architecture: Each node has its own CPU, memory, and storage, reducing bottlenecks.
- Scalability: Easily scales horizontally by adding more nodes.
- High Performance: Optimized for handling large datasets and complex queries.
Common MPP Systems:
- Cloud-Based: Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse Analytics
- On-Premise: Teradata, Greenplum, IBM Netezza
MPP is ideal for data warehousing, analytics, and ETL workloads that require fast query performance on large datasets.
---------------------------------------------------------------------------------------------------------------------------
I/O (Input/Output)
I/O (Input/Output) refers to the communication between a computer system and external devices or components. It involves the transfer of data between the system's processor/memory and peripheral devices like disks, keyboards, networks, or external storage.
Types of I/O:
- Input: Data received by the system (e.g., keyboard, mouse, sensors, network data).
- Output: Data sent from the system to external devices (e.g., display, printer, network transmission).
I/O in Computing:
- Disk I/O: Reading/writing data to hard drives or SSDs.
- Network I/O: Data exchange over the internet or internal networks.
- Memory I/O: Data transfer between RAM and CPU.
I/O Performance Considerations:
- Throughput: Amount of data transferred per second.
- Latency: Time taken to complete an I/O operation.
- Buffering & Caching: Used to optimize I/O performance.
Efficient I/O management is critical for system performance, especially in big data processing, databases, and high-performance computing (HPC) environments.
-----------------------------------------------------------------------------------------------------------------------------
Column-Oriented (Columnar) Storage
Column-oriented storage is a database storage technique where data is stored by columns instead of rows. This approach is highly optimized for analytical queries and data warehousing, where operations like aggregation and filtering on specific columns are common.
Key Characteristics:
- Column-wise Storage: Data is stored column by column rather than row by row.
- Efficient Compression: Similar data in a column compresses better than in row-based storage.
- Faster Query Performance: Ideal for read-heavy analytical queries (e.g.,
SUM
,AVG
,COUNT
). - Optimized for OLAP (Online Analytical Processing): Not suitable for transactional workloads (OLTP).
Comparison with Row-Oriented Storage:
Feature | Row-Oriented (Traditional DB) | Column-Oriented (Analytical DB) |
---|---|---|
Storage | Stores complete rows together | Stores values of a single column together |
Read Performance | Slower for analytical queries | Faster for analytical queries |
Write Performance | Faster for transactional inserts | Slower for frequent writes |
Compression | Less effective | Highly efficient |
Examples of Column-Oriented Databases:
- Google BigQuery
- Amazon Redshift
- Snowflake
- Apache Parquet (file format)
- Apache ORC (file format)
Column-oriented databases are widely used in data warehouses, big data analytics, and business intelligence applications.
-----------------------------------------------------------------------------------------------------------------------------
Amazon Redshift Architecture
Amazon Redshift is a fully managed, cloud-based data warehouse that uses a Massively Parallel Processing (MPP) architecture for high-performance querying and analytics on large datasets.
Key Components of Redshift Architecture:
Cluster
A Redshift cluster is the core unit containing multiple nodes that store and process data. It consists of:
- Leader Node (Manages queries & coordination)
- Compute Nodes (Store data & execute queries)
Leader Node
- Receives queries from users.
- Parses, optimizes, and distributes queries to compute nodes.
- Aggregates results and sends them back to users.
Compute Nodes
- Perform actual query execution.
- Store data in a columnar format for efficient retrieval.
- Each node is divided into slices, where each slice processes a portion of the data.
Node Types
- Dense Compute (DC2) → For high-performance, SSD-based workloads.
- Dense Storage (RA3, DS2) → For large datasets with HDD-based storage.
Storage Architecture
- Uses columnar storage for high compression and fast queries.
- Distributes data across nodes using distribution styles (Even, Key, or All).
- Uses Zone Maps & Sort Keys to optimize data retrieval.
Query Execution
- Queries are broken into parallel tasks and executed across compute nodes.
- Uses AWS S3 for backups and Redshift Spectrum for querying external data.
Why Redshift?
Fast Query Performance (Columnar storage + MPP)
Scalability (Easily add/remove nodes)
Cost-Effective (Pay-as-you-go pricing)
Integration with AWS Services (S3, Glue, Athena, Lambda)
Redshift is ideal for data warehousing, business intelligence (BI), and large-scale analytics in enterprises. 🚀
-----------------------------------------------------------------------------------------------------------------------------
Cluster (Computing & Databases)
A cluster is a group of interconnected computers or servers that work together as a single system to improve performance, scalability, and fault tolerance. Clusters are commonly used in high-performance computing (HPC), databases, and cloud environments.
Types of Clusters
High-Availability (HA) Cluster
- Ensures continuous availability of applications.
- If one node fails, another takes over (failover mechanism).
- Used in mission-critical applications like banking systems.
Load Balancing Cluster
- Distributes workloads across multiple servers.
- Ensures efficient resource utilization and high availability.
- Common in web applications and cloud services.
High-Performance Computing (HPC) Cluster
- Used for scientific computing, simulations, and AI.
- Uses parallel processing to solve complex problems.
- Examples: Supercomputers, Machine Learning clusters.
Database Cluster
- A group of database servers that share data and workload.
- Provides fault tolerance and horizontal scaling.
- Examples: Amazon Redshift, MySQL Cluster, PostgreSQL Cluster.
Cluster in Data Warehousing (e.g., Redshift, BigQuery)
- Contains multiple nodes for parallel processing.
- Leader Node manages query distribution.
- Compute Nodes store and process data in MPP (Massively Parallel Processing) mode.
Benefits of Clusters
Scalability – Easily add more nodes as data grows.
Fault Tolerance – Redundant nodes prevent downtime.
High Performance – Distributes workloads efficiently.
Cost Efficiency – Optimizes resource usage.
Clusters are essential in big data analytics, cloud computing, and AI/ML applications to ensure efficient processing and high availability. 🚀
-----------------------------------------------------------------------------------------------------------------------------
OLAP vs. OLTP
OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing) are two distinct database processing systems designed for different purposes.
OLAP (Online Analytical Processing)
Purpose: Used for data analysis, reporting, and business intelligence (BI).
Data Type: Historical, aggregated, multi-dimensional data.
Operations: Complex queries, data mining, and analytical calculations (e.g., SUM
, AVG
, ROLLUP
).
Speed: Optimized for read-heavy workloads (fast query performance).
Storage: Uses columnar storage for efficient querying.
Examples:
- Data warehouses (Amazon Redshift, Snowflake, Google BigQuery)
- BI tools (Power BI, Tableau)
📌 Use Case: Analyzing sales trends over the past 5 years.
OLTP (Online Transaction Processing)
Purpose: Used for real-time transaction processing in operational systems.
Data Type: Current, highly detailed transactional data.
Operations: Frequent INSERT, UPDATE, DELETE operations.
Speed: Optimized for write-heavy workloads (high-speed transactions).
Storage: Uses row-based storage for fast lookups and updates.
Examples:
- Banking systems
- E-commerce order processing
- Reservation systems (airline, hotel)
Use Case: Processing a customer’s online order in real time.
Key Differences:
Feature | OLAP (Analytical) | OLTP (Transactional) |
---|---|---|
Purpose | Analysis & Reporting | Real-Time Transactions |
Data Type | Historical, Aggregated | Current, Detailed |
Queries | Complex (JOINs, Aggregations) | Simple (INSERT, UPDATE, DELETE) |
Storage | Columnar | Row-Based |
Speed | Optimized for Reads | Optimized for Writes |
Example | Amazon Redshift, Snowflake | MySQL, PostgreSQL, SQL Server |
Summary
- OLAP is for analysis, focusing on fast querying.
- OLTP is for transactions, focusing on fast updates and inserts.
- Both can coexist, with OLTP systems feeding data into OLAP systems for analysis.
Hybrid Example: Some modern databases like Google BigQuery, Snowflake, and Azure Synapse blend OLTP + OLAP for real-time analytics.
-----------------------------------------------------------------------------------------------------------------------------
Node (Computing & Databases)
A node is a fundamental unit in a distributed computing system, network, or database cluster. It represents an individual computer, server, or processing unit that performs specific tasks in coordination with other nodes.
Types of Nodes in Different Systems:
Computing & Distributed Systems
- A node can be any device (computer, server, or virtual machine) that participates in a network.
- Example: In a Hadoop cluster, each node processes a part of the dataset.
Database & Data Warehousing Nodes
- Leader Node → Manages and coordinates queries.
- Compute Nodes → Store data and execute queries in parallel.
- Example: Amazon Redshift, Google BigQuery, Snowflake use nodes for Massively Parallel Processing (MPP).
Blockchain Nodes
- Maintain a copy of the blockchain ledger.
- Validate and propagate transactions.
- Example: Bitcoin, Ethereum nodes.
Network Nodes
- Any device (router, switch, or computer) that sends or receives data.
- Example: DNS servers, IoT devices.
Key Characteristics of Nodes
Scalability → More nodes improve performance & storage.
Fault Tolerance → If a node fails, others continue working.
Parallel Processing → Tasks are distributed across multiple nodes.
Example: Amazon Redshift Node Types
- Dense Compute (DC2) → High-speed, SSD-based.
- Dense Storage (RA3, DS2) → Large data capacity, HDD-based.
Nodes are the building blocks of distributed computing, cloud computing, and data engineering!
-----------------------------------------------------------------------------------------------------------------------------
Redshift Spectrum
Redshift Spectrum is a feature of Amazon Redshift that allows you to query data stored in Amazon S3 directly without loading it into Redshift tables. It enables fast, cost-effective analytics on exabyte-scale datasets.
How Redshift Spectrum Works?
Data Storage in S3 → Your raw data is stored in Amazon S3 in formats like CSV, Parquet, ORC, Avro, JSON.
External Schema in AWS Glue or Redshift → Define an external schema using AWS Glue Data Catalog or Redshift's external schema.
Query from Redshift → Run SQL queries using Amazon Redshift SQL (SELECT
, JOIN
, GROUP BY
, etc.).
Spectrum Query Engine → Redshift Spectrum processes queries without moving data to Redshift.
Results Combined → If needed, Redshift can join S3 data with existing Redshift tables.
Key Features of Redshift Spectrum
No Data Loading Required → Query data directly from S3.
Supports Multiple File Formats → Works with Parquet, ORC, JSON, Avro, CSV, TSV, and more.
Massively Parallel Processing (MPP) → Uses multiple nodes to process queries quickly.
Integrates with AWS Glue → Uses AWS Glue Data Catalog to define table structures.
Cost-Effective → Pay only for the data scanned (per TB pricing).
Redshift Spectrum vs. Traditional Redshift
Feature | Redshift Spectrum | Traditional Redshift |
---|---|---|
Data Location | Amazon S3 (External) | Redshift Cluster (Internal) |
Performance | Good for large, rarely accessed data | Faster for frequently accessed data |
Cost | Pay per TB scanned | Pay for Redshift storage |
Use Case | Querying external big data | High-speed data warehousing |
When to Use Redshift Spectrum?
Big Data Analytics → Query petabytes of data stored in S3.
Data Lake Queries → Analyze unstructured/semi-structured data.
Log Analysis → Process log files stored in S3.
Cost Optimization → Store cold data in S3 and only query when needed.
Example Query Using Redshift Spectrum
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'my_s3_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/MySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
SELECT * FROM spectrum_schema.sales_data WHERE year = 2024;
Summary
🔹 Redshift Spectrum lets you query data in Amazon S3 without loading it into Redshift.
🔹 It reduces storage costs and is great for analyzing large datasets.
🔹 Best used for infrequently accessed, massive datasets.
💡 Hybrid Approach → Keep frequently accessed data in Redshift and archive older data in S3 while still querying it with Redshift Spectrum! 🚀
-----------------------------------------------------------------------------------------------------------------------------
Amazon S3 (Simple Storage Service)
Amazon S3 (Simple Storage Service) is an object storage service provided by AWS that allows you to store and retrieve any amount of data from anywhere on the internet. It is highly scalable, secure, and cost-effective, making it a popular choice for cloud storage.
Key Features of Amazon S3
Object Storage → Stores data as objects (files) inside buckets (containers).
Scalability → Handles petabytes of data and billions of objects.
Durability → 99.999999999% (11 nines) durability with data replicated across multiple locations.
Security & Access Control → Uses IAM roles, bucket policies, and encryption to secure data.
Cost-Effective → Pay only for what you use with different storage classes.
Serverless → No need to manage servers; fully managed by AWS.
S3 Storage Classes (Cost Optimization)
Storage Class | Use Case | Cost & Performance |
---|---|---|
S3 Standard | Frequently accessed data | High performance, low latency |
S3 Intelligent-Tiering | Unpredictable access patterns | Automatically moves data between tiers |
S3 Standard-IA (Infrequent Access) | Infrequent access, but fast retrieval | Lower cost than Standard |
S3 One Zone-IA | Infrequent access, stored in one zone | Cheaper but less resilient |
S3 Glacier | Archiving data | Retrieval within minutes to hours |
S3 Glacier Deep Archive | Long-term archival | Cheapest, retrieval in hours |
How S3 Works?
Create a Bucket → A bucket is a storage container for objects (files).
Upload Objects → Store files (text, images, videos, backups, logs, etc.).
Set Permissions → Control access using IAM policies, ACLs, or bucket policies.
Retrieve Data → Use REST API, SDKs, AWS CLI, or AWS Console.
Example: Upload & Retrieve File Using AWS CLI
🔹 Upload a File to S3:
aws s3 cp myfile.txt s3://my-bucket-name/
🔹 Download a File from S3:
aws s3 cp s3://my-bucket-name/myfile.txt
🔹 List Objects in a Bucket:
aws s3 ls s3://my-bucket-name/
Common Use Cases of S3
Big Data & Analytics → Stores raw and processed data (used with Redshift Spectrum, Athena, Databricks).
Backup & Disaster Recovery → Secure and durable storage for backups.
Data Lake → Centralized data storage for structured and unstructured data.
Hosting Static Websites → S3 can serve HTML, CSS, JS files directly.
Machine Learning → Stores training datasets for AI/ML models.
S3 vs Traditional Storage
Feature | Amazon S3 | Traditional Storage |
---|---|---|
Scalability | Unlimited storage | Limited storage |
Access | Global, via API/SDK | Local network access |
Durability | 99.999999999% (11 nines) | Risk of data loss |
Management | Fully managed | Requires maintenance |
Pricing | Pay-as-you-go | Fixed infrastructure costs |
Summary
🔹 Amazon S3 is a highly durable, scalable, and cost-efficient cloud storage solution.
🔹 It supports multiple storage classes to optimize costs based on data access frequency.
🔹 S3 is commonly used for data lakes, analytics, backups, and web hosting.
💡 Best Practice → Combine S3 + AWS Lambda + AWS Glue for a serverless data pipeline!
-----------------------------------------------------------------------------------------------------------------------------
AWS IAM (Identity and Access Management)
AWS IAM (Identity and Access Management) is a security service that enables you to manage who can access AWS resources and what actions they can perform. It provides fine-grained permissions, allowing organizations to securely control user access to AWS services.
Key Features of AWS IAM
User Management → Create and manage users, groups, and roles.
Granular Permissions → Define permissions using IAM policies.
Multi-Factor Authentication (MFA) → Add an extra security layer for user logins.
Federation & SSO → Integrate with Active Directory (AD), Google, or Okta.
Temporary Credentials → Use IAM roles to grant short-term access.
Least Privilege Access → Follow best security practices by giving only necessary permissions.
IAM Core Components
Component | Description | Example |
---|---|---|
IAM User | Individual account with AWS access | john_doe with s3:ListBucket access |
IAM Group | Collection of users with shared permissions | Developers group with EC2 access |
IAM Role | Temporary permissions assigned to users or AWS services | Lambda function assumes S3ReadOnly role |
IAM Policy | JSON-based document defining access rules | Allows GetObject for s3://my-bucket/* |
IAM Access Keys | Credentials for programmatic access | Used in AWS CLI & SDKs |
MFA (Multi-Factor Authentication) | Adds an extra layer of security | Required for console login |
Example: IAM Policy (JSON Format)
The following policy allows a user to read objects from an S3 bucket:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*"
}
]
}
IAM Roles vs IAM Users
Feature | IAM Role | IAM User |
---|---|---|
Best Use Case | Used by AWS services (EC2, Lambda, etc.) | Used by humans or applications |
Authentication | Temporary credentials (STS) | Long-term credentials (username/password) |
Security | More secure (no static credentials) | Requires careful key management |
Example | EC2 assumes an IAM role to access S3 | A developer logs into the AWS console |
IAM Best Practices
🔹 Follow Least Privilege → Grant only necessary permissions.
🔹 Use IAM Roles Instead of Users → Avoid long-term credentials.
🔹 Enable MFA → Protect IAM users with Multi-Factor Authentication.
🔹 Rotate Access Keys Regularly → Reduce risk of compromised credentials.
🔹 Monitor IAM Activity → Use AWS CloudTrail to track actions.
Summary
🔹 AWS IAM is a security service that controls access to AWS resources.
🔹 Uses policies, roles, users, and groups to manage permissions.
🔹 Best practice: Use IAM roles for AWS services instead of IAM users.
💡 Secure your AWS environment by implementing IAM best practices!
-----------------------------------------------------------------------------------------------------------------------------Column-Level Security (CLS)
Column-Level Security (CLS) is a data security mechanism that restricts access to specific columns within a database table, ensuring that users can only view or modify the columns they are authorized to access. It helps organizations protect sensitive data such as personally identifiable information (PII) or financial details.
🔷 Why Use Column-Level Security?
Protect Sensitive Data → Restrict access to PII, financial, and health records.
Compliance & Regulations → Enforce GDPR, HIPAA, PCI DSS security rules.
Fine-Grained Access Control → Users see only the columns they are allowed to view.
Better Performance → Avoid creating multiple views or redundant tables for security.
Methods to Implement Column-Level Security
Method | Description | Use Case |
---|---|---|
Database Views | Create a view that exposes only allowed columns | Limit access to sensitive columns |
GRANT & REVOKE (SQL Permissions) | Use SQL commands to restrict column access | Control read/write permissions for columns |
Dynamic Data Masking | Automatically hides data based on user roles | Mask PII data (e.g., XXXX-XX-1234 ) |
Row-Level Security (RLS) + CLS | Combine CLS with RLS for enhanced security | Restrict data per user and per column |
Example: Column-Level Security in SQL Server
🔹 Grant SELECT on specific columns only:
GRANT SELECT (name, email) ON Employees TO EmployeeRole;
🔹 Revoke access to sensitive columns:
REVOKE SELECT (salary, ssn) ON Employees FROM EmployeeRole;
🔹 Using a view to restrict access:
CREATE VIEW Employee_Public AS
SELECT name, department FROM Employees;
🔹 Dynamic Data Masking (DDM) for sensitive data:
ALTER TABLE Employees
ALTER COLUMN ssn ADD MASKED WITH (FUNCTION = 'partial(0,"XXX-XX-",4)');
Column-Level Security in Cloud Platforms
Platform | CLS Feature | Example |
---|---|---|
AWS Redshift | Column-level privileges | GRANT SELECT (name) ON Employees TO analyst; |
Azure Synapse | Column masking + security roles | Dynamic Data Masking for SSNs |
Google BigQuery | Column-level access policies | Restrict access using IAM roles |
Summary
🔹 Column-Level Security (CLS) ensures that users can only access authorized columns.
🔹 It protects sensitive data, enforces compliance, and improves security.
🔹 Implemented via SQL permissions, views, masking, or IAM policies in cloud platforms.
💡 Best Practice: Combine CLS with Row-Level Security (RLS) for maximum data protection!
-----------------------------------------------------------------------------------------------------------------------------
Caching
Caching is the process of storing frequently accessed data in a temporary storage layer to improve performance and reduce the need for repeated expensive computations or database queries. It helps speed up data retrieval, reduce latency, and optimize resource usage.
Why Use Caching?
Improves Performance → Faster data retrieval without querying the database.
Reduces Load on Backend → Fewer database hits, saving computing resources.
Enhances Scalability → Handles high traffic efficiently with quick responses.
Lowers Costs → Minimizes database query costs in cloud environments.
-----------------------------------------------------------------------------------------------------------------------------
Shared-Nothing Architecture
Shared-Nothing Architecture (SNA) is a distributed computing design where each node in the system operates independently, with no shared memory or storage between them. Each node has its own CPU, RAM, and disk, and communication between nodes happens via a network.
Key Features
No Resource Contention → Each node is self-sufficient, preventing bottlenecks.
High Scalability → Nodes can be added without affecting existing ones.
Fault Tolerance → A failed node does not impact others.
Parallel Processing → Each node works independently, enabling MPP (Massively Parallel Processing).
How Shared-Nothing Works?
- Data is partitioned across nodes (e.g., hashing, range-based partitioning).
- Each node processes its own portion of the data.
- Nodes communicate via a network when necessary.
🔹 Example: Distributed Databases (Sharding)
In a Shared-Nothing database system, data is sharded across multiple database servers. Each server manages its own partition without sharing CPU, memory, or disk with others.
Shared-Nothing vs. Other Architectures
Architecture | Description | Example |
---|---|---|
Shared-Nothing | Each node has its own resources, no shared memory | Amazon Redshift, Google BigQuery |
Shared-Disk | Nodes share storage but have separate CPUs and memory | Oracle RAC, IBM Db2 |
Shared-Memory | Multiple CPUs share memory and storage | Traditional multi-core servers |
Examples of Shared-Nothing Systems
System | Type | Use Case |
---|---|---|
Amazon Redshift | Data warehouse | Parallel query execution |
Google BigQuery | Cloud analytics | Distributed data processing |
Apache Hadoop HDFS | Big Data storage | Distributed file system |
NoSQL Databases (MongoDB, Cassandra) | Distributed databases | High availability & scalability |
Advantages of Shared-Nothing Architecture
✔ Better Scalability → Easily add more nodes without performance degradation.
✔ Higher Fault Tolerance → Node failures don’t impact the system as a whole.
✔ Efficient Parallelism → Workloads are distributed across nodes.
🔹 Best for: Big Data Analytics, Cloud Data Warehousing, Distributed Databases
-----------------------------------------------------------------------------------------------------------------------------
VACUUM, ANALYZE, and EXPLAIN in Amazon Redshift
Amazon Redshift provides VACUUM, ANALYZE, and EXPLAIN commands to optimize query performance and manage data efficiently.
VACUUM (Table Optimization)
🔹 Purpose: Reorganizes tables to improve query performance by reclaiming disk space and sorting data.
🔹 Why?: In Redshift, deleted or updated rows are marked as "ghost rows" (not removed immediately). VACUUM helps physically remove them and re-sort the table.
Syntax
VACUUM [ FULL | SORT ONLY | DELETE ONLY | REINDEX ] table_name;
Types of VACUUM
Type | Description |
---|---|
FULL | Reclaims space and re-sorts the table (default) |
SORT ONLY | Sorts the table without reclaiming space |
DELETE ONLY | Removes dead rows without sorting |
REINDEX | Rebuilds interleaved sort keys for performance |
Example
VACUUM FULL my_table;
🔹 Best Practice: Run VACUUM
during low-traffic periods to avoid performance impact.
ANALYZE (Statistics Update)
🔹 Purpose: Updates metadata about table distribution and statistics, helping the query planner make better decisions.
🔹 Why?: Without fresh statistics, queries might execute inefficiently.
Syntax
ANALYZE [table_name];
Example
ANALYZE my_table;
🔹 Best Practice: Run ANALYZE
after INSERT, UPDATE, DELETE, or VACUUM.
EXPLAIN (Query Execution Plan)
🔹 Purpose: Shows the execution plan without running the query, helping diagnose performance issues.
🔹 Why?: Helps identify bottlenecks like high data movement, inefficient joins, or missing sort keys.
Syntax
EXPLAIN query;
Example
EXPLAIN SELECT * FROM my_table WHERE id = 100;
🔹 Output Sample:
XN Seq Scan on my_table (cost=0.00..431.00 rows=10000 width=4)
- Seq Scan → Full table scan (may need indexes/sort keys).
- Cost → Estimated query execution cost.
Summary
Command | Purpose | When to Use? |
---|---|---|
VACUUM | Removes deleted rows & sorts data | After large DELETE or UPDATE operations |
ANALYZE | Updates table statistics | After INSERT, UPDATE, DELETE |
EXPLAIN | Shows query execution plan | Before running a slow query |
Best Practice: Automate VACUUM
and ANALYZE
using AWS Redshift Maintenance Scripts!
- Get link
- X
- Other Apps
Comments
Post a Comment