Amazon Redshift knowledge ingestion choices

Amazon Redshift, a warehousing service, presents quite a lot of choices for ingesting knowledge from various sources into its high-performance, scalable surroundings. Whether or not your knowledge resides in operational databases, knowledge lakes, on-premises techniques, Amazon Elastic Compute Cloud (Amazon EC2), or different AWS providers, Amazon Redshift gives a number of ingestion strategies to satisfy your particular wants. The at the moment obtainable selections embrace:

This publish explores every choice (as illustrated within the following determine), determines that are appropriate for various use instances, and discusses how and why to pick out a particular Amazon Redshift instrument or characteristic for knowledge ingestion.

A box indicating Amazon Redshift in the center of the image with boxes from right to left for Amazon RDS MySQL and PostgreSQL, Amazon Aurora MySQL and PostreSQL, Amazon EMR, Amazon Glue, Amazon S3 bucket, Amazon Managed Streaming for Apache Kafka and Amazon Kinesis. Each box has an arrow pointing to Amazon Redshift. Each arrow has the following labels: Amazon RDS & Amazon Aurora: zero-ETL and federated queries; AWS Glue and Amazon EMR: spark connector; Amazon S3 bucket: COPY command; Amazon Managed Streaming for Apache Kafka and Amazon Kinesis: redshift streaming. Amazon Data Firehose has an arrow pointing to Amazon S3 bucket indicating the data flow direction.

Amazon Redshift COPY command

The Redshift COPY command, a easy low-code knowledge ingestion instrument, hundreds knowledge into Amazon Redshift from Amazon S3, DynamoDB, Amazon EMR, and distant hosts over SSH. It’s a quick and environment friendly technique to load massive datasets into Amazon Redshift. It makes use of massively parallel processing (MPP) structure in Amazon Redshift to learn and cargo massive quantities of knowledge in parallel from recordsdata or knowledge from supported knowledge sources. This lets you make the most of parallel processing by splitting knowledge into a number of recordsdata, particularly when the recordsdata are compressed.

Advisable use instances for the COPY command embrace loading massive datasets and knowledge from supported knowledge sources. COPY mechanically splits massive uncompressed delimited textual content recordsdata into smaller scan ranges to make the most of the parallelism of Amazon Redshift provisioned clusters and serverless workgroups. With auto-copy, automation enhances the COPY command by including jobs for automated ingestion of knowledge.

COPY command benefits:

  • Efficiency – Effectively hundreds massive datasets from Amazon S3 or different sources in parallel with optimized throughput
  • Simplicity – Easy and user-friendly, requiring minimal setup
  • Value-optimized – Makes use of Amazon Redshift MPP at a decrease value by decreasing knowledge switch time
  • Flexibility – Helps file codecs corresponding to CSV, JSON, Parquet, ORC, and AVRO

Amazon Redshift federated queries

Amazon Redshift federated queries permit you to incorporate stay knowledge from Amazon RDS or Aurora operational databases as a part of enterprise intelligence (BI) and reporting functions.

Federated queries are helpful to be used instances the place organizations need to mix knowledge from their operational techniques with knowledge saved in Amazon Redshift. Federated queries enable querying knowledge throughout Amazon RDS for MySQL and PostgreSQL knowledge sources with out the necessity for extract, remodel, and cargo (ETL) pipelines. If storing operational knowledge in an information warehouse is a requirement, synchronization of tables between operational knowledge shops and Amazon Redshift tables is supported. In eventualities the place knowledge transformation is required, you should utilize Redshift saved procedures to change knowledge in Redshift tables.

Federated queries key options:

  • Actual-time entry – Allows querying of stay knowledge throughout discrete sources, corresponding to Amazon RDS and Aurora, with out the necessity to transfer the information
  • Unified knowledge view – Gives a single view of knowledge throughout a number of databases, simplifying knowledge evaluation and reporting
  • Value financial savings – Eliminates the necessity for ETL processes to maneuver knowledge into Amazon Redshift, saving on storage and compute prices
  • Flexibility – Helps Amazon RDS and Aurora knowledge sources, providing flexibility in accessing and analyzing distributed knowledge

Amazon Redshift Zero-ETL integration

Aurora zero-ETL integration with Amazon Redshift permits entry to operational knowledge from Amazon Aurora MySQL-Appropriate (and Amazon Aurora PostgreSQL-Appropriate Version, Amazon RDS for MySQL in preview), and DynamoDB from Amazon Redshift with out the necessity for ETL in close to actual time. You should utilize zero-ETL to simplify ingestion pipelines for performing change knowledge seize (CDC) from an Aurora database to Amazon Redshift. Constructed on the mixing of Amazon Redshift and Aurora storage layers, zero-ETL boasts easy setup, knowledge filtering, automated observability, auto-recovery, and integration with both Amazon Redshift provisioned clusters or Amazon Redshift Serverless workgroups.

Zero-ETL integration advantages:

  • Seamless integration – Mechanically integrates and synchronizes knowledge between operational databases and Amazon Redshift with out the necessity for customized ETL processes
  • Close to real-time insights – Gives close to real-time knowledge updates, so essentially the most present knowledge is offered for evaluation
  • Ease of use – Simplifies knowledge structure by eliminating the necessity for separate ETL instruments and processes
  • Effectivity – Minimizes knowledge latency and gives knowledge consistency throughout techniques, enhancing general knowledge accuracy and reliability

Amazon Redshift integration for Apache Spark

The Amazon Redshift integration for Apache Spark, mechanically included by Amazon EMR or AWS Glue, gives efficiency and safety optimizations when in comparison with the community-provided connector. The combination enhances and simplifies safety with AWS Id and Entry Administration (IAM) authentication assist. AWS Glue 4.0 gives a visible ETL instrument for authoring jobs to learn from and write to Amazon Redshift, utilizing the Redshift Spark connector for connectivity. This simplifies the method of constructing ETL pipelines to Amazon Redshift. The Spark connector permits use of Spark functions to course of and remodel knowledge earlier than loading into Amazon Redshift. The combination minimizes the guide strategy of organising a Spark connector and shortens the time wanted to organize for analytics and machine studying (ML) duties. It means that you can specify the connection to an information warehouse and begin working with Amazon Redshift knowledge out of your Apache Spark-based functions inside minutes.

The combination gives pushdown capabilities for type, combination, restrict, be part of, and scalar perform operations to optimize efficiency by transferring solely the related knowledge from Amazon Redshift to the consuming Apache Spark utility. Spark jobs are appropriate for knowledge processing pipelines and when it’s essential to use Spark’s superior knowledge transformation capabilities.

With the Amazon Redshift integration for Apache Spark, you may simplify the constructing of ETL pipelines with knowledge transformation necessities. It presents the next advantages:

  • Excessive efficiency – Makes use of the distributed computing energy of Apache Spark for large-scale knowledge processing and evaluation
  • Scalability – Effortlessly scales to deal with huge datasets by distributing computation throughout a number of nodes
  • Flexibility – Helps a variety of knowledge sources and codecs, offering versatility in knowledge processing duties
  • Interoperability – Seamlessly integrates with Amazon Redshift for environment friendly knowledge switch and queries

Amazon Redshift streaming ingestion

The important thing good thing about Amazon Redshift streaming ingestion is the flexibility to ingest a whole bunch of megabytes of knowledge per second instantly from streaming sources into Amazon Redshift with very low latency, supporting real-time analytics and insights. Supporting streams from Kinesis Knowledge Streams, Amazon MSK, and Knowledge Firehose, streaming ingestion requires no knowledge staging, helps versatile schemas, and is configured with SQL. Streaming ingestion powers real-time dashboards and operational analytics by instantly ingesting knowledge into Amazon Redshift materialized views.

Amazon Redshift streaming ingestion unlocks close to real-time streaming analytics with:

  • Low latency – Ingests streaming knowledge in close to actual time, making streaming ingestion ideally suited for time-sensitive functions corresponding to Web of Issues (IoT), monetary transactions, and clickstream evaluation
  • Scalability – Manages excessive throughput and enormous volumes of streaming knowledge from sources corresponding to Kinesis Knowledge Streams, Amazon MSK, and Knowledge Firehose
  • Integration – Integrates with different AWS providers to construct end-to-end streaming knowledge pipelines
  • Steady updates – Retains knowledge in Amazon Redshift repeatedly up to date with the newest data from the information streams

Amazon Redshift ingestion use instances and examples

On this part, we talk about the main points of various Amazon Redshift ingestion use instances and supply examples.

Redshift COPY use case: Software log knowledge ingestion and evaluation

Ingesting utility log knowledge saved in Amazon S3 is a typical use case for the Redshift COPY command. Knowledge engineers in a corporation want to research utility log knowledge to realize insights into person habits, establish potential points, and optimize a platform’s efficiency. To attain this, knowledge engineers ingest log knowledge in parallel from a number of recordsdata saved in S3 buckets into Redshift tables. This parallelization makes use of the Amazon Redshift MPP structure, permitting for quicker knowledge ingestion in comparison with different ingestion strategies.

The next code is an instance of the COPY command loading knowledge from a set of CSV recordsdata in an S3 bucket right into a Redshift desk:

COPY myschema.mytable
FROM 's3://my-bucket/knowledge/recordsdata/'
IAM_ROLE ‘arn:aws:iam::1234567891011:function/MyRedshiftRole’
FORMAT AS CSV;

This code makes use of the next parameters:

  • mytable is the goal Redshift desk for knowledge load
  • s3://my-bucket/knowledge/recordsdata/‘ is the S3 path the place the CSV recordsdata are situated
  • IAM_ROLE specifies the IAM function required to entry the S3 bucket
  • FORMAT AS CSV specifies that the information recordsdata are in CSV format

Along with Amazon S3, the COPY command hundreds knowledge from different sources, corresponding to DynamoDB, Amazon EMR, distant hosts by SSH, or different Redshift databases. The COPY command gives choices to specify knowledge codecs, delimiters, compression, and different parameters to deal with totally different knowledge sources and codecs.

To get began with the COPY command, see Utilizing the COPY command to load from Amazon S3.

Federated queries use case: Built-in reporting and analytics for a retail firm

For this use case, a retail firm has an operational database operating on Amazon RDS for PostgreSQL, which shops real-time gross sales transactions, stock ranges, and buyer data knowledge. Moreover, an information warehouse runs on Amazon Redshift storing historic knowledge for reporting and analytics functions. To create an built-in reporting resolution that mixes real-time operational knowledge with historic knowledge within the knowledge warehouse, with out the necessity for multi-step ETL processes, full the next steps:

  1. Arrange community connectivity. Be sure your Redshift cluster and RDS for PostgreSQL occasion are in the identical digital non-public cloud (VPC) or have community connectivity established by VPC peering, AWS PrivateLink, or AWS Transit Gateway.
  2. Create a secret and IAM function for federated queries:
    1. In AWS Secrets and techniques Supervisor, create a brand new secret to retailer the credentials (person identify and password) in your Amazon RDS for PostgreSQL occasion.
    2. Create an IAM function with permissions to entry the Secrets and techniques Supervisor secret and the Amazon RDS for PostgreSQL occasion.
    3. Affiliate the IAM function along with your Amazon Redshift cluster.
  3. Create an exterior schema in Amazon Redshift:
    1. Hook up with your Redshift cluster utilizing a SQL shopper or the question editor v2 on the Amazon Redshift console.
    2. Create an exterior schema that references your Amazon RDS for PostgreSQL occasion:
CREATE EXTERNAL SCHEMA postgres_schema
FROM POSTGRES
DATABASE 'mydatabase'
SCHEMA 'public'
URI 'endpoint-for-your-rds-instance.aws-region.rds.amazonaws.com:5432'
IAM_ROLE 'arn:aws:iam::123456789012:function/RedshiftRoleForRDS'
SECRET_ARN 'arn:aws:secretsmanager:aws-region:123456789012:secret:my-rds-secret-abc123';

  1. Question tables in your Amazon RDS for PostgreSQL occasion instantly from Amazon Redshift utilizing federated queries:
SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.class
FROM
    postgres_schema.orders r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id
WHERE
    r.order_date >= '2024-01-01';

  1. Create views or materialized views in Amazon Redshift that mix the operational knowledge from federated queries with the historic knowledge in Amazon Redshift for reporting functions:
CREATE MATERIALIZED VIEW sales_report AS
SELECT
    r.order_id,
    r.order_date,
    r.customer_name,
    r.total_amount,
    h.product_name,
    h.class,
    h.historical_sales
FROM
    (
        SELECT
            order_id,
            order_date,
            customer_name,
            total_amount,
            product_id
        FROM
            postgres_schema.orders
    ) r
    JOIN redshift_schema.product_history h ON r.product_id = h.product_id;

With this implementation, federated queries in Amazon Redshift combine real-time operational knowledge from Amazon RDS for PostgreSQL situations with historic knowledge in a Redshift knowledge warehouse. This strategy eliminates the necessity for multi-step ETL processes and allows you to create complete experiences and analytics that mix knowledge from a number of sources.

To get began with Amazon Redshift federated question ingestion, see Querying knowledge with federated queries in Amazon Redshift.

Zero-ETL integration use case: Close to real-time analytics for an ecommerce utility

Suppose an ecommerce utility constructed on Aurora MySQL-Appropriate manages on-line orders, buyer knowledge, and product catalogs. To carry out close to real-time analytics with knowledge filtering on transactional knowledge to realize insights into buyer habits, gross sales developments, and stock administration with out the overhead of constructing and sustaining multi-step ETL pipelines, you should utilize zero-ETL integrations for Amazon Redshift. Full the next steps:

  1. Arrange an Aurora MySQL cluster (have to be operating Aurora MySQL model 3.05-compatible with MySQL 8.0.32 or larger):
    1. Create an Aurora MySQL cluster in your required AWS Area.
    2. Configure the cluster settings, such because the occasion sort, storage, and backup choices.
  2. Create a zero-ETL integration with Amazon Redshift:
    1. On the Amazon RDS console, navigate to the Zero-ETL integrations
    2. Select Create integration and choose your Aurora MySQL cluster because the supply.
    3. Select an present Redshift cluster or create a brand new cluster because the goal.
    4. Present a reputation for the mixing and evaluate the settings.
    5. Select Create integration to provoke the zero-ETL integration course of.
  3. Confirm the mixing standing:
    1. After the mixing is created, monitor the standing on the Amazon RDS console or by querying the SVV_INTEGRATION and SYS_INTEGRATION_ACTIVITY system views in Amazon Redshift.
    2. Anticipate the mixing to achieve the Lively state, indicating that knowledge is being replicated from Aurora to Amazon Redshift.
  4. Create analytics views:
    1. Hook up with your Redshift cluster utilizing a SQL shopper or the question editor v2 on the Amazon Redshift console.
    2. Create views or materialized views that mix and remodel the replicated knowledge from Aurora in your analytics use instances:
CREATE MATERIALIZED VIEW orders_summary AS
SELECT
    o.order_id,
    o.customer_id,
    SUM(oi.amount * oi.worth) AS total_revenue,
    MAX(o.order_date) AS latest_order_date
FROM
    aurora_schema.orders o
    JOIN aurora_schema.order_items oi ON o.order_id = oi.order_id
GROUP BY
    o.order_id,
    o.customer_id;

  1. Question the views or materialized views in Amazon Redshift to carry out close to real-time analytics on the transactional knowledge out of your Aurora MySQL cluster:
SELECT
	customer_id,
	SUM(total_revenue) AS total_customer_revenue,
	MAX(latest_order_date) AS most_recent_order
FROM
	orders_summary
GROUP BY
	customer_id
ORDER BY
	total_customer_revenue DESC;

This implementation achieves close to real-time analytics for an ecommerce utility’s transactional knowledge utilizing the zero-ETL integration between Aurora MySQL-Appropriate and Amazon Redshift. The information mechanically replicates from Aurora to Amazon Redshift, eliminating the necessity for multi-step ETL pipelines and supporting insights from the newest knowledge rapidly.

To get began with Amazon Redshift zero-ETL integrations, see Working with zero-ETL integrations. To be taught extra about Aurora zero-ETL integrations with Amazon Redshift, see Amazon Aurora zero-ETL integrations with Amazon Redshift.

Integration for Apache Spark use case: Gaming participant occasions written to Amazon S3

Take into account a big quantity of gaming participant occasions saved in Amazon S3. The occasions require knowledge transformation, cleaning, and preprocessing to extract insights, generate experiences, or construct ML fashions. On this case, you should utilize the scalability and processing energy of Amazon EMR to carry out the required knowledge modifications utilizing Apache Spark. After it’s processed, the reworked knowledge have to be loaded into Amazon Redshift for additional evaluation, reporting, and integration with BI instruments.

On this state of affairs, you should utilize the Amazon Redshift integration for Apache Spark to carry out the mandatory knowledge transformations and cargo the processed knowledge into Amazon Redshift. The next implementation instance assumes gaming participant occasions in Parquet format are saved in Amazon S3 (s3://<bucket_name>/player_events/).

  1. Launch an Amazon EMR (emr-6.9.0) cluster with Apache Spark (Spark 3.3.0) with Amazon Redshift integration with Apache Spark assist.
  2. Configure the mandatory IAM function for accessing Amazon S3 and Amazon Redshift.
  3. Add safety group guidelines to Amazon Redshift to permit entry to the provisioned cluster or serverless workgroup.
  4. Create a Spark job that units up a connection to Amazon Redshift, reads knowledge from Amazon S3, performs transformations, and writes ensuing knowledge to Amazon Redshift. See the next code:
from pyspark.sql import SparkSession
from pyspark.sql.features import col, lit
import os

def primary():

	# Create a SparkSession
	spark = SparkSession.builder 
    		.appName("RedshiftSparkJob") 
    		.getOrCreate()

	# Set Amazon Redshift connection properties
	Redshift_jdbc_url = "jdbc:redshift://<redshift-endpoint>:<port>/<database>"
	redshift_table = "<schema>.<table_name>"
	temp_s3_bucket = "s3://<bucket_name>/temp/"
	iam_role_arn = "<iam_role_arn>"

	# Learn knowledge from Amazon S3
	s3_data = spark.learn.format("parquet") 
    		.load("s3://<bucket_name>/player_events/")

	# Carry out transformations
	transformed_data = s3_data.withColumn("transformed_column", lit("transformed_value"))

	# Write the reworked knowledge to Amazon Redshift
	transformed_data.write 
    		.format("io.github.spark_redshift_community.spark.redshift") 
    		.choice("url", redshift_jdbc_url) 
    		.choice("dbtable", redshift_table) 
    		.choice("tempdir", temp_s3_bucket) 
    		.choice("aws_iam_role", iam_role_arn) 
    		.mode("overwrite") 
    		.save()

if __name__ == "__main__":
    primary()

On this instance, you first import the mandatory modules and create a SparkSession. Set the connection properties for Amazon Redshift, together with the endpoint, port, database, schema, desk identify, momentary S3 bucket path, and the IAM function ARN for authentication. Learn knowledge from Amazon S3 in Parquet format utilizing the spark.learn.format("parquet").load() methodology. Carry out a change on the Amazon S3 knowledge by including a brand new column transformed_column with a relentless worth utilizing the withColumn methodology and the lit perform. Write the reworked knowledge to Amazon Redshift utilizing the write methodology and the io.github.spark_redshift_community.spark.redshift format. Set the mandatory choices for the Redshift connection URL, desk identify, momentary S3 bucket path, and IAM function ARN. Use the mode("overwrite") choice to overwrite the prevailing knowledge within the Amazon Redshift desk with the reworked knowledge.

To get began with Amazon Redshift integration for Apache Spark, see Amazon Redshift integration for Apache Spark. For extra examples of utilizing the Amazon Redshift for Apache Spark connector, see New – Amazon Redshift Integration with Apache Spark.

Streaming ingestion use case: IoT telemetry close to real-time evaluation

Think about a fleet of IoT gadgets (sensors and industrial tools) that generate a steady stream of telemetry knowledge corresponding to temperature readings, stress measurements, or operational metrics. Ingesting this knowledge in actual time to carry out analytics to watch the gadgets, detect anomalies, and make data-driven choices requires a streaming resolution built-in with a Redshift knowledge warehouse.

On this instance, we use Amazon MSK because the streaming supply for IoT telemetry knowledge.

  1. Create an exterior schema in Amazon Redshift:
    1. Hook up with an Amazon Redshift cluster utilizing a SQL shopper or the question editor v2 on the Amazon Redshift console.
    2. Create an exterior schema that references the MSK cluster:
CREATE EXTERNAL SCHEMA kafka_schema
FROM KAFKA
BROKER 'broker-1.instance.com:9092,broker-2.instance.com:9092'
TOPIC 'iot-telemetry-topic'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::123456789012:function/RedshiftRoleForMSK';

  1. Create a materialized view in Amazon Redshift:
    1. Outline a materialized view that maps the Kafka matter knowledge to Amazon Redshift desk columns.
    2. CAST the streaming message payload knowledge sort to the Amazon Redshift SUPER sort.
    3. Set the materialized view to auto refresh.
CREATE MATERIALIZED VIEW iot_telemetry_view
AUTO REFRESH YES
AS SELECT
    kafka_partition,
    kafka_offset,
    kafka_timestamp_type,
    kafka_timestamp,
    CAST(kafka_value AS SUPER) payload
FROM kafka_schema.iot-telemetry-topic;

  1. Question the iot_telemetry_view materialized view to entry the real-time IoT telemetry knowledge ingested from the Kafka matter. The materialized view will mechanically refresh as new knowledge arrives within the Kafka matter.
SELECT
    kafka_timestamp,
    payload:device_id,
    payload:temperature,
    payload:stress
FROM iot_telemetry_view;

With this implementation, you may obtain close to real-time analytics on IoT system telemetry knowledge utilizing Amazon Redshift streaming ingestion. As telemetry knowledge is obtained by an MSK matter, Amazon Redshift mechanically ingests and displays the information in a materialized view, supporting question and evaluation of the information in close to actual time.

To get began with Amazon Redshift streaming ingestion, see Streaming ingestion to a materialized view. To be taught extra about streaming and buyer use instances, see Amazon Redshift Streaming Ingestion.

Conclusion

This publish detailed the choices obtainable for Amazon Redshift knowledge ingestion. The selection of knowledge ingestion methodology is dependent upon components corresponding to the dimensions and construction of knowledge, the necessity for real-time entry or transformations, knowledge sources, present infrastructure, ease of use, and person skill-sets. Zero-ETL integrations and federated queries are appropriate for easy knowledge ingestion duties or becoming a member of knowledge between operational databases and Amazon Redshift analytics knowledge. Giant-scale knowledge ingestion with transformation and orchestration profit from Amazon Redshift integration with Apache Spark with Amazon EMR and AWS Glue. Bulk loading of knowledge into Amazon Redshift no matter dataset dimension matches completely with the capabilities of the Redshift COPY command. Using streaming sources corresponding to Kinesis Knowledge Streams, Amazon MSK, or Knowledge Firehose are ideally suited eventualities for using AWS streaming providers integration for knowledge ingestion.

Consider the options and steerage supplied in your knowledge ingestion workloads and tell us your suggestions within the feedback.


In regards to the Authors

Steve Phillips is a senior technical account supervisor at AWS within the North America area. Steve has labored with video games prospects for eight years and at the moment focuses on knowledge warehouse architectural design, knowledge lakes, knowledge ingestion pipelines, and cloud distributed architectures.

Sudipta Bagchi is a Sr. Specialist Options Architect at Amazon Net Providers. He has over 14 years of expertise in knowledge and analytics, and helps prospects design and construct scalable and high-performant analytics options. Exterior of labor, he loves operating, touring, and taking part in cricket.

Leave a Reply

Your email address will not be published. Required fields are marked *