Evaluating pattern Amazon Redshift knowledge sharing structure utilizing Redshift Take a look at Drive and superior SQL evaluation

With the launch of Amazon Redshift Serverless and the varied provisioned occasion deployment choices, prospects are in search of instruments that assist them decide probably the most optimum knowledge warehouse configuration to assist their Amazon Redshift workloads.

Amazon Redshift is a extensively used, totally managed, petabyte-scale knowledge warehouse service. Tens of 1000’s of consumers use Amazon Redshift to course of exabytes of knowledge day-after-day to energy their analytics workloads.

Redshift Take a look at Drive is a device hosted on the GitHub repository that permit prospects consider which knowledge warehouse configurations choices are finest suited to their workload. The Take a look at Drive Workload Replicator utility consists of scripts that can be utilized to extract the workload queries out of your supply warehouse audit logs and replay them on a goal warehouse you launched. The Take a look at Drive Configuration Comparability utility automates this course of by deploying goal Amazon Redshift warehouses and orchestrating the replay of the supply workload by a mixture of AWS CloudFormation and AWS StepFunctions.

Each utilities unload the efficiency metrics from the replay of the supply workload on the goal configuration(s) to Amazon Easy Storage Service (Amazon S3), which is used as a storage to retailer the efficiency metrics. Though the Replay Evaluation UI and Configuration Comparability utility can present a preliminary efficiency comparability, many purchasers wish to dig deeper by analyzing the uncooked knowledge themselves.

The walkthrough illustrates an instance workload replayed on a single Amazon Redshift knowledge warehouse and knowledge sharing structure utilizing the Workload Replicator utility, the output of which might be used to judge the efficiency of the workload.

Use case overview

For the pattern use case, we assumed we now have an current 2 x ra3.4xlarge provisioned knowledge warehouse that presently runs extract, rework, and cargo (ETL), advert hoc, and enterprise intelligence (BI) queries. We’re concerned about breaking these workloads aside utilizing knowledge sharing right into a 32 base Redshift Processing Unit (RPU) Serverless producer operating ETL and a 64 base RPU Serverless shopper operating the BI workload. We used Workload Replicator to replay the workload on a duplicate baseline of the supply and goal knowledge sharing configuration as specified within the tutorial. The next picture reveals the method circulation.

Producing and accessing Take a look at Drive metrics

The outcomes of Amazon Redshift Take a look at Drive may be accessed utilizing an exterior schema for evaluation of a replay. Check with the Workload Replicator README and the Configuration Comparability README for extra detailed directions to execute a replay utilizing the respective device.

The exterior schema for evaluation is routinely created with the Configuration Comparability utility, during which case you possibly can proceed on to the SQL evaluation within the Deploy the QEv2 SQL Pocket book and analyze workload part. Should you use Workload Replicator, nonetheless, the exterior schema just isn’t created routinely, and subsequently must be configured as a prerequisite to the SQL evaluation. We reveal within the following walkthrough how the exterior schema may be arrange, utilizing pattern evaluation of the Knowledge Sharing use case.

Executing Take a look at Drive Workload Replicator for knowledge sharing

To execute Workload Replicator, use Amazon Elastic Compute Cloud (Amazon EC2) to run the automation scripts used to extract the workload from the supply.

Configure Amazon Redshift Knowledge Warehouse

  1. Create a snapshot following the steering within the Amazon Redshift Administration Information.
  2. Allow audit logging following the steering in Amazon Redshift Administration Information.
  3. Allow the consumer exercise logging of the supply cluster following the steering Amazon Redshift Administration Information.

Enabling logging requires a change of the parameter group. Audit logging must be enabled previous to the workload that might be replayed as a result of that is the place the connections and SQL queries of the workload might be extracted from.

  1. Launch the baseline reproduction from the snapshot by restoring a 2 node ra3.4xlarge provisioned cluster from the snapshot.
  2. Launch the producer warehouse by restoring the snapshot to a 32 RPU serverless namespace.
  3. The buyer shouldn’t include the schema and tables that might be shared from the producer. You’ll be able to launch the 64 RPU Serverless shopper both from the snapshot after which drop the related objects, or you possibly can create a brand new 64 RPU Serverless shopper warehouse and recreate shopper customers.
  4. Create a datashare from the producer to the buyer and add the related objects.

Knowledge share objects may be learn utilizing two mechanisms: utilizing three-part notation (database.schema.desk), or by creating an exterior schema pointing to a shared schema and querying that utilizing two-part notation (external_schema.desk). As a result of we wish to seamlessly run the supply workload, which makes use of two-part notation on the native objects, this publish demonstrates the latter strategy. For every schema shared from the producer, run the next command on the buyer:

CREATE EXTERNAL SCHEMA schema_name 
FROM REDSHIFT DATABASE ‘datashare_database_name’ SCHEMA ‘schema_name’;

Ensure to make use of the identical schema title because the supply for the exterior schema. Additionally, if any queries are run on the general public schema, drop the native public schema first earlier than creating the exterior equal.

  1. Grant utilization on the schema for any related customers.

Configure Redshift Take a look at Drive Workload Replicator

  1. Create an S3 bucket to retailer the artifacts required by the utility (such because the metrics, extracted workload, and output knowledge from operating UNLOAD instructions).
  2. Launch the next three varieties of EC2 situations utilizing the beneficial configuration of m5.8xlarge, 32GB of SSD storage, and Amazon Linux AMI:
    1. Baseline occasion
    2. Goal-producer occasion
    3. Goal-consumer occasion

Ensure you can hook up with the EC2 occasion to run the utility.

  1. For every occasion, set up the required libraries by finishing the next steps from the GitHub repository:
    a. 2.i
    b. 2.ii (if an ODBC driver ought to be used—the default is the Amazon Redshift Python driver)
    c. 2.iii
    d. 2.iv
    e. 2.v
  2. Create an AWS Identification and Entry Administration (IAM) function for the EC2 situations to entry the Amazon Redshift warehouses, to learn from the S3 audit logging bucket, and with each learn and write entry to the brand new S3 bucket created for storing replay artifacts.
  3. If you will run COPY and UNLOAD instructions, create an IAM function with entry to the S3 buckets required, and connect it to the Amazon Redshift warehouses that can execute the load and unload.

On this instance, the IAM function is hooked up to the baseline reproduction and producer warehouses as a result of these might be executing the ETL processes. The utility will replace UNLOAD instructions to unload knowledge to a bucket you outline, which as a finest follow ought to be the bucket created for S3 artifacts. Write permissions should be granted to the Amazon Redshift warehouse for this location.

Run Redshift Take a look at Drive Workload Replicator

  1. Run aws configure on the EC2 situations and populate the default Area with the Area the utility is being executed in.
  2. Extract solely must be run as soon as, so hook up with the baseline EC2 occasion and run vi config/extract.yaml to open the extract.yaml file and configure the extraction particulars (choose i to start configuring components, then use escape to go away edit mode and :wq! to go away vi). For extra particulars on the parameters, see Configure parameters.

The next code is an instance of a configured extract that unloads the logs for a half hour window to the Take a look at Drive artifacts bucket and updates COPY instructions to run with the POC Amazon Redshift function.

Configuration Extract File

  1. Run make extract to extract the workload. When accomplished, make be aware of the folder created on the path specified for the workload_location parameter within the extract (s3://testdriveartifacts/myworkload/Extraction_xxxx-xx-xxTxx:xx:xx.xxxxxx+00:00).
  2. On the identical baseline EC2 occasion that can run the complete workload on the supply reproduction, run vi config/replay.yaml and configure the main points with the workload location copied within the earlier step 3 and the baseline warehouse endpoint. (See extra particulars on the parameters Configure parameters to run an extract job. The values after the analysis_iam_role parameter may be left because the default).

The next code is an instance for the start of a replay configuration for the supply reproduction.

Config Reply File

  1. On the EC2 occasion that can run the target-producer workload, run vi config/replay.yaml. Configure the main points with the workload location copied within the earlier step 3, the producer warehouse endpoint and different configuration as in step 4. As a way to replay solely the producer workload, add the suitable customers to incorporate or exclude for the filters parameter.

The next code is an instance of the filters used to exclude the BI workload from the producer.

Producer Configuration

  1. On the EC2 occasion that can run the target-consumer workload, run vi config/replay.yaml and configure the main points with the workload location copied within the earlier step 3, the buyer warehouse endpoint, and applicable filters as for step 5. The identical customers that have been excluded on the producer workload replay ought to be included within the shopper workload replay.

The next is an instance of the filters used to solely run the BI workload from the buyer.

Consumer Configuration

  1. Run make replay on the baseline occasion, target-producer occasion, and target-consumer occasion concurrently to run the workload on the goal warehouses.

Analyze the Workload Replicator output

  1. Create the folder construction within the S3 bucket that was created within the earlier step.
'{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/'

For comparison_stats_s3_path, enter the S3 bucket and path title. For what_if_timestamp, enter the replay begin time. For cluster_identifier, enter the goal cluster title for straightforward identification.

The next screenshot reveals

Configuration of S3

  1. Use the next script to unload system desk knowledge for every goal cluster to a corresponding Amazon S3 goal path that was created beforehand within the baseline Redshift cluster utilizing QEv2.
UNLOAD ($$
SELECT a.*,Trim(u.usename) as username FROM sys_query_history a , pg_user u
WHERE a.user_id = u.usesysid
and a.start_time > to_timestamp('{what_if_timestamp}','YYYY-MM- DD-HH24-MI-SS')
$$) TO '{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/'
FORMAT AS PARQUET PARALLEL OFF ALLOWOVERWRITE 
IAM_ROLE '{redshift_iam_role}';

For what_if_timestamp, enter the replay begin time. For comparison_stats_s3_path, enter the S3 bucket and path title. For cluster_identifier, enter the goal cluster title for straightforward identification. For redshift_iam_role, enter the Amazon Useful resource Identify (ARN) of the Redshift IAM function for the goal cluster.

Unload File

  1. Create an exterior schema in Amazon Redshift with the title comparison_stats.
CREATE EXTERNAL SCHEMA comparison_stats from DATA CATALOG
DATABASE 'redshift_config_comparison'
IAM_ROLE '{redshift-iam-role}'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

  1. Create an exterior desk in Amazon Redshift with the title redshift_config_comparision_aggregate primarily based on the Amazon S3 file location.
CREATE EXTERNAL TABLE comparison_stats.redshift_config_comparision_aggregate 
(user_id int,
query_id bigint,
query_label VARCHAR,
transaction_id bigint,
session_id int,
database_name VARCHAR,
query_type VARCHAR,
standing VARCHAR,
result_cache_hit boolean,
start_time timestamp,
end_time timestamp,
elapsed_time bigint,
queue_time bigint,
execution_time bigint,
error_message VARCHAR,
returned_rows bigint,
returned_bytes bigint,
query_text VARCHAR,
redshift_version VARCHAR,
usage_limit VARCHAR,
compute_type VARCHAR,
compile_time bigint,
planning_time bigint,
lock_wait_time bigint,
username VARCHAR)
PARTITIONED BY (cluster_identifier VARCHAR)
STORED AS PARQUET
LOCATION '{comparison_stats_s3_path}/{what_if_timestamp}';

  1. After making a partitioned desk, alter the desk utilizing the next assertion to register partitions to the exterior catalog.

While you add a partition, you outline the situation of the subfolder on Amazon S3 that accommodates the partition knowledge. Run that assertion for every cluster identifier.

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="{cluster_identifier}")
LOCATION '{comparison_stats_s3_path}/{what_if_timestamp}/{cluster_identifier}/';

Instance:

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="baseline-ra3-4xlarge-2")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/baseline-ra3-4xlarge-2/';

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="producer-serverless32RPU")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/producer-serverless32RPU/';

ALTER TABLE comparison_stats.redshift_config_comparision_aggregate 
ADD PARTITION (cluster_identifier="consumer-serverless64RPU")
LOCATION 's3://workload-replicator-redshift/2024-03-05-21-00-00/consumer-serverless64RPU/';

Deploy the QEv2 SQL Pocket book and analyze workload

On this part, we analyze the queries that have been replayed in each the baseline and goal clusters. We analyze the workload primarily based on the frequent queries which are executed within the baseline and goal clusters.

  1. Obtain the evaluation pocket book from Amazon S3.
  2. Import the pocket book into the baseline Redshift clusters utilizing QEv2. For steering, confer with the Authoring and operating notebooks.
  3. Create the saved process common_queries_sp in the identical database that was used to create the exterior schema.
  4. The saved process will create a view referred to as common_queries by querying the exterior desk redshift_config_comparison_aggregate that was created in earlier steps.

The view will determine the queries frequent to each the baseline and goal clusters as talked about within the pocket book.

  1. Execute the saved process by passing the cluster identifiers for the baseline and goal clusters as parameters to the saved process.

For this publish, we handed the baseline and producer cluster identifier because the parameters. Passing the cluster identifiers as parameters will retrieve the information just for these particular clusters.

As soon as the common_queries view is created, you possibly can carry out additional evaluation utilizing subsequent queries which are out there within the pocket book. When you’ve got multiple goal cluster, you possibly can observe the identical evaluation course of for each. For this publish, we now have two goal clusters: producer and shopper. We first carried out the evaluation between the baseline and producer clusters, then repeated the identical course of to research the information for the baseline versus shopper clusters.

To research our workload, we’ll use the sys_query_history view. We ceaselessly use a number of columns from this view, together with the next:

  • elapsed_time: The top-to-end time of the question run
  • execution_time: The time the question spent operating. Within the case of a SELECT question, this additionally contains the return time.
  • compile_time: The time the question spent compiling

For extra data on sys_query_history, confer with SYS_QUERY_HISTORY within the Amazon Redshift Database Developer Information. The next desk reveals the descriptions of the evaluation queries.

Identify of the question Description
1 Total workload by consumer Rely of frequent queries between baseline and goal clusters primarily based on consumer
2 Total workload by question kind Rely of frequent queries between baseline and goal clusters primarily based on question kind
3 Total workload comparability (in seconds) Evaluate the general workload between the baseline and goal clusters by analyzing the execution time, compile time, and elapsed time
4 Percentile workload comparability The proportion of queries that carry out at or beneath that runtime (for instance, p50_s having the worth of 5 seconds means 50% of queries in that workload have been 5 seconds or sooner)
5 Variety of enhance/degrade/keep identical queries The variety of queries degraded/stayed the identical/improved when evaluating the elapsed time between the baseline and goal clusters
6 Diploma of query-level efficiency change (proportion) The diploma of change of the question from the baseline to focus on relative to the baseline efficiency
7 Comparability by question kind (in seconds) Evaluate the elapsed time of various question sorts akin to SELECT, INSERT, and COPY instructions between the baseline cluster and goal cluster
8 Prime 10 slowest operating queries (in seconds) Prime 10 slowest queries between the baseline and goal cluster by evaluating the elapsed time of each clusters
9 Prime 10 improved queries (in seconds) The highest 10 queries with probably the most improved elapsed time when evaluating the baseline cluster to the goal cluster

Pattern Outcomes evaluation

In our instance, the general workload enchancment for workload isolation structure utilizing knowledge sharing for ETL workload between baseline and producer is 858 seconds (baseline_elapsed_timetarget_elapsed_time) for the pattern TPC knowledge, as proven within the following screenshots.

Baseline vs Producer Analysis

Baseline vs Producer Analysis Grpah

The general workload enchancment for workload isolation structure utilizing knowledge sharing for BI workload between baseline and shopper is 1148 seconds (baseline_elapsed_timetarget_elapsed_time) for pattern TPC knowledge, as proven within the following screenshots.

Baseline vs Consumer Analysis

Baseline vs Consumer Analysis Graph

Cleanup

Full the next steps to wash up your sources:

  1. Delete the Redshift provisioned reproduction cluster and the 2 Redshift serverless endpoints (32 RPU and 64 RPU)
  2. Delete the S3 bucket used to retailer the artifacts
  3. Delete the baseline, target-producer, and target-consumer EC2 situations
  4. Delete the IAM function created for the EC2 situations to entry Redshift clusters and S3 buckets
  5. Delete the IAM roles created for Amazon Redshift warehouses to entry S3 buckets for COPY and UNLOAD instructions

Conclusion

On this publish, we walked you thru the method of testing workload isolation structure utilizing Amazon Redshift Knowledge Sharing and Take a look at Drive utility. We demonstrated how you should use SQL for superior value efficiency evaluation and evaluate completely different workloads on completely different goal Redshift cluster configurations. We encourage you to judge your Amazon Redshift knowledge sharing structure utilizing the Redshift Take a look at Drive device. Use the offered SQL script to research the price-performance of your Amazon Redshift cluster.


In regards to the Authors

Ayan Majumder is an Analytics Specialist Options Architect at AWS. His experience lies in designing strong, scalable, and environment friendly cloud options for purchasers. Past his skilled life, he derives pleasure from touring, pictures, and out of doors actions.

Ekta Ahuja is an Amazon Redshift Specialist Options Architect at AWS. She is keen about serving to prospects construct scalable and strong knowledge and analytics options. Earlier than AWS, she labored in a number of completely different knowledge engineering and analytics roles. Exterior of labor, she enjoys panorama pictures, touring, and board video games.

Julia BeckJulia Beck is an Analytics Specialist Options Architect at AWS. She is keen about supporting prospects in validating and optimizing analytics options by architecting proof of idea workloads designed to fulfill their particular wants.

Leave a Reply

Your email address will not be published. Required fields are marked *