With the launch of Amazon Redshift Serverless and the varied provisioned occasion deployment choices, prospects are in search of instruments that assist them decide probably the most optimum knowledge warehouse configuration to assist their Amazon Redshift workloads.
Amazon Redshift is a extensively used, totally managed, petabyte-scale knowledge warehouse service. Tens of 1000’s of consumers use Amazon Redshift to course of exabytes of knowledge day-after-day to energy their analytics workloads.
Redshift Take a look at Drive is a device hosted on the GitHub repository that permit prospects consider which knowledge warehouse configurations choices are finest suited to their workload. The Take a look at Drive Workload Replicator utility consists of scripts that can be utilized to extract the workload queries out of your supply warehouse audit logs and replay them on a goal warehouse you launched. The Take a look at Drive Configuration Comparability utility automates this course of by deploying goal Amazon Redshift warehouses and orchestrating the replay of the supply workload by a mixture of AWS CloudFormation and AWS StepFunctions.
Each utilities unload the efficiency metrics from the replay of the supply workload on the goal configuration(s) to Amazon Easy Storage Service (Amazon S3), which is used as a storage to retailer the efficiency metrics. Though the Replay Evaluation UI and Configuration Comparability utility can present a preliminary efficiency comparability, many purchasers wish to dig deeper by analyzing the uncooked knowledge themselves.
The walkthrough illustrates an instance workload replayed on a single Amazon Redshift knowledge warehouse and knowledge sharing structure utilizing the Workload Replicator utility, the output of which might be used to judge the efficiency of the workload.
Use case overview
For the pattern use case, we assumed we now have an current 2 x ra3.4xlarge
provisioned knowledge warehouse that presently runs extract, rework, and cargo (ETL), advert hoc, and enterprise intelligence (BI) queries. We’re concerned about breaking these workloads aside utilizing knowledge sharing right into a 32 base Redshift Processing Unit (RPU) Serverless producer operating ETL and a 64 base RPU Serverless shopper operating the BI workload. We used Workload Replicator to replay the workload on a duplicate baseline of the supply and goal knowledge sharing configuration as specified within the tutorial. The next picture reveals the method circulation.
Producing and accessing Take a look at Drive metrics
The outcomes of Amazon Redshift Take a look at Drive may be accessed utilizing an exterior schema for evaluation of a replay. Check with the Workload Replicator README and the Configuration Comparability README for extra detailed directions to execute a replay utilizing the respective device.
The exterior schema for evaluation is routinely created with the Configuration Comparability utility, during which case you possibly can proceed on to the SQL evaluation within the Deploy the QEv2 SQL Pocket book and analyze workload part. Should you use Workload Replicator, nonetheless, the exterior schema just isn’t created routinely, and subsequently must be configured as a prerequisite to the SQL evaluation. We reveal within the following walkthrough how the exterior schema may be arrange, utilizing pattern evaluation of the Knowledge Sharing use case.
Executing Take a look at Drive Workload Replicator for knowledge sharing
To execute Workload Replicator, use Amazon Elastic Compute Cloud (Amazon EC2) to run the automation scripts used to extract the workload from the supply.
Configure Amazon Redshift Knowledge Warehouse
- Create a snapshot following the steering within the Amazon Redshift Administration Information.
- Allow audit logging following the steering in Amazon Redshift Administration Information.
- Allow the consumer exercise logging of the supply cluster following the steering Amazon Redshift Administration Information.
Enabling logging requires a change of the parameter group. Audit logging must be enabled previous to the workload that might be replayed as a result of that is the place the connections and SQL queries of the workload might be extracted from.
- Launch the baseline reproduction from the snapshot by restoring a 2 node
ra3.4xlarge
provisioned cluster from the snapshot. - Launch the producer warehouse by restoring the snapshot to a 32 RPU serverless namespace.
- The buyer shouldn’t include the schema and tables that might be shared from the producer. You’ll be able to launch the 64 RPU Serverless shopper both from the snapshot after which drop the related objects, or you possibly can create a brand new 64 RPU Serverless shopper warehouse and recreate shopper customers.
- Create a datashare from the producer to the buyer and add the related objects.
Knowledge share objects may be learn utilizing two mechanisms: utilizing three-part notation (database.schema.desk
), or by creating an exterior schema pointing to a shared schema and querying that utilizing two-part notation (external_schema.desk
). As a result of we wish to seamlessly run the supply workload, which makes use of two-part notation on the native objects, this publish demonstrates the latter strategy. For every schema shared from the producer, run the next command on the buyer:
Ensure to make use of the identical schema title because the supply for the exterior schema. Additionally, if any queries are run on the general public schema, drop the native public schema first earlier than creating the exterior equal.
- Grant utilization on the schema for any related customers.
Configure Redshift Take a look at Drive Workload Replicator
- Create an S3 bucket to retailer the artifacts required by the utility (such because the metrics, extracted workload, and output knowledge from operating
UNLOAD
instructions). - Launch the next three varieties of EC2 situations utilizing the beneficial configuration of
m5.8xlarge
, 32GB of SSD storage, and Amazon Linux AMI:- Baseline occasion
- Goal-producer occasion
- Goal-consumer occasion
Ensure you can hook up with the EC2 occasion to run the utility.
- For every occasion, set up the required libraries by finishing the next steps from the GitHub repository:
a. 2.i
b. 2.ii (if an ODBC driver ought to be used—the default is the Amazon Redshift Python driver)
c. 2.iii
d. 2.iv
e. 2.v - Create an AWS Identification and Entry Administration (IAM) function for the EC2 situations to entry the Amazon Redshift warehouses, to learn from the S3 audit logging bucket, and with each learn and write entry to the brand new S3 bucket created for storing replay artifacts.
- If you will run COPY and UNLOAD instructions, create an IAM function with entry to the S3 buckets required, and connect it to the Amazon Redshift warehouses that can execute the load and unload.
On this instance, the IAM function is hooked up to the baseline reproduction and producer warehouses as a result of these might be executing the ETL processes. The utility will replace UNLOAD
instructions to unload knowledge to a bucket you outline, which as a finest follow ought to be the bucket created for S3 artifacts. Write permissions should be granted to the Amazon Redshift warehouse for this location.
Run Redshift Take a look at Drive Workload Replicator
- Run
aws configure
on the EC2 situations and populate the default Area with the Area the utility is being executed in. - Extract solely must be run as soon as, so hook up with the baseline EC2 occasion and run
vi config/extract.yaml
to open theextract.yaml
file and configure the extraction particulars (choosei
to start configuring components, then useescape
to go away edit mode and:wq!
to go away vi). For extra particulars on the parameters, see Configure parameters.
The next code is an instance of a configured extract that unloads the logs for a half hour window to the Take a look at Drive artifacts bucket and updates COPY
instructions to run with the POC Amazon Redshift function.
- Run
make extract
to extract the workload. When accomplished, make be aware of the folder created on the path specified for theworkload_location
parameter within the extract (s3://testdriveartifacts/myworkload/Extraction_xxxx-xx-xxTxx:xx:xx.xxxxxx+00:00
). - On the identical baseline EC2 occasion that can run the complete workload on the supply reproduction, run
vi config/replay.yaml
and configure the main points with the workload location copied within the earlier step 3 and the baseline warehouse endpoint. (See extra particulars on the parameters Configure parameters to run an extract job. The values after theanalysis_iam_role
parameter may be left because the default).
The next code is an instance for the start of a replay configuration for the supply reproduction.
- On the EC2 occasion that can run the target-producer workload, run
vi config/replay.yaml
. Configure the main points with the workload location copied within the earlier step 3, the producer warehouse endpoint and different configuration as in step 4. As a way to replay solely the producer workload, add the suitable customers to incorporate or exclude for the filters parameter.
The next code is an instance of the filters used to exclude the BI workload from the producer.
- On the EC2 occasion that can run the target-consumer workload, run
vi config/replay.yaml
and configure the main points with the workload location copied within the earlier step 3, the buyer warehouse endpoint, and applicable filters as for step 5. The identical customers that have been excluded on the producer workload replay ought to be included within the shopper workload replay.
The next is an instance of the filters used to solely run the BI workload from the buyer.
- Run
make replay
on the baseline occasion, target-producer occasion, and target-consumer occasion concurrently to run the workload on the goal warehouses.
Analyze the Workload Replicator output
- Create the folder construction within the S3 bucket that was created within the earlier step.
For comparison_stats_s3_path
, enter the S3 bucket and path title. For what_if_timestamp
, enter the replay begin time. For cluster_identifier
, enter the goal cluster title for straightforward identification.
The next screenshot reveals
- Use the next script to unload system desk knowledge for every goal cluster to a corresponding Amazon S3 goal path that was created beforehand within the baseline Redshift cluster utilizing QEv2.
For what_if_timestamp
, enter the replay begin time. For comparison_stats_s3_path
, enter the S3 bucket and path title. For cluster_identifier
, enter the goal cluster title for straightforward identification. For redshift_iam_role
, enter the Amazon Useful resource Identify (ARN) of the Redshift IAM function for the goal cluster.
- Create an exterior schema in Amazon Redshift with the title
comparison_stats
.
- Create an exterior desk in Amazon Redshift with the title
redshift_config_comparision_aggregate
primarily based on the Amazon S3 file location.
- After making a partitioned desk, alter the desk utilizing the next assertion to register partitions to the exterior catalog.
While you add a partition, you outline the situation of the subfolder on Amazon S3 that accommodates the partition knowledge. Run that assertion for every cluster identifier.
Instance:
Deploy the QEv2 SQL Pocket book and analyze workload
On this part, we analyze the queries that have been replayed in each the baseline and goal clusters. We analyze the workload primarily based on the frequent queries which are executed within the baseline and goal clusters.
- Obtain the evaluation pocket book from Amazon S3.
- Import the pocket book into the baseline Redshift clusters utilizing QEv2. For steering, confer with the Authoring and operating notebooks.
- Create the saved process
common_queries_sp
in the identical database that was used to create the exterior schema. - The saved process will create a view referred to as
common_queries
by querying the exterior deskredshift_config_comparison_aggregate
that was created in earlier steps.
The view will determine the queries frequent to each the baseline and goal clusters as talked about within the pocket book.
- Execute the saved process by passing the cluster identifiers for the baseline and goal clusters as parameters to the saved process.
For this publish, we handed the baseline and producer cluster identifier because the parameters. Passing the cluster identifiers as parameters will retrieve the information just for these particular clusters.
As soon as the common_queries
view is created, you possibly can carry out additional evaluation utilizing subsequent queries which are out there within the pocket book. When you’ve got multiple goal cluster, you possibly can observe the identical evaluation course of for each. For this publish, we now have two goal clusters: producer and shopper. We first carried out the evaluation between the baseline and producer clusters, then repeated the identical course of to research the information for the baseline versus shopper clusters.
To research our workload, we’ll use the sys_query_history
view. We ceaselessly use a number of columns from this view, together with the next:
elapsed_time
: The top-to-end time of the question runexecution_time
: The time the question spent operating. Within the case of a SELECT question, this additionally contains the return time.compile_time
: The time the question spent compiling
For extra data on sys_query_history
, confer with SYS_QUERY_HISTORY within the Amazon Redshift Database Developer Information. The next desk reveals the descriptions of the evaluation queries.
Identify of the question | Description | |
1 | Total workload by consumer | Rely of frequent queries between baseline and goal clusters primarily based on consumer |
2 | Total workload by question kind | Rely of frequent queries between baseline and goal clusters primarily based on question kind |
3 | Total workload comparability (in seconds) | Evaluate the general workload between the baseline and goal clusters by analyzing the execution time, compile time, and elapsed time |
4 | Percentile workload comparability | The proportion of queries that carry out at or beneath that runtime (for instance, p50_s having the worth of 5 seconds means 50% of queries in that workload have been 5 seconds or sooner) |
5 | Variety of enhance/degrade/keep identical queries | The variety of queries degraded/stayed the identical/improved when evaluating the elapsed time between the baseline and goal clusters |
6 | Diploma of query-level efficiency change (proportion) | The diploma of change of the question from the baseline to focus on relative to the baseline efficiency |
7 | Comparability by question kind (in seconds) | Evaluate the elapsed time of various question sorts akin to SELECT , INSERT , and COPY instructions between the baseline cluster and goal cluster |
8 | Prime 10 slowest operating queries (in seconds) | Prime 10 slowest queries between the baseline and goal cluster by evaluating the elapsed time of each clusters |
9 | Prime 10 improved queries (in seconds) | The highest 10 queries with probably the most improved elapsed time when evaluating the baseline cluster to the goal cluster |
Pattern Outcomes evaluation
In our instance, the general workload enchancment for workload isolation structure utilizing knowledge sharing for ETL workload between baseline and producer is 858 seconds (baseline_elapsed_time
– target_elapsed_time
) for the pattern TPC knowledge, as proven within the following screenshots.
The general workload enchancment for workload isolation structure utilizing knowledge sharing for BI workload between baseline and shopper is 1148 seconds (baseline_elapsed_time
– target_elapsed_time
) for pattern TPC knowledge, as proven within the following screenshots.
Cleanup
Full the next steps to wash up your sources:
- Delete the Redshift provisioned reproduction cluster and the 2 Redshift serverless endpoints (32 RPU and 64 RPU)
- Delete the S3 bucket used to retailer the artifacts
- Delete the baseline, target-producer, and target-consumer EC2 situations
- Delete the IAM function created for the EC2 situations to entry Redshift clusters and S3 buckets
- Delete the IAM roles created for Amazon Redshift warehouses to entry S3 buckets for
COPY
andUNLOAD
instructions
Conclusion
On this publish, we walked you thru the method of testing workload isolation structure utilizing Amazon Redshift Knowledge Sharing and Take a look at Drive utility. We demonstrated how you should use SQL for superior value efficiency evaluation and evaluate completely different workloads on completely different goal Redshift cluster configurations. We encourage you to judge your Amazon Redshift knowledge sharing structure utilizing the Redshift Take a look at Drive device. Use the offered SQL script to research the price-performance of your Amazon Redshift cluster.
In regards to the Authors
Ayan Majumder is an Analytics Specialist Options Architect at AWS. His experience lies in designing strong, scalable, and environment friendly cloud options for purchasers. Past his skilled life, he derives pleasure from touring, pictures, and out of doors actions.
Ekta Ahuja is an Amazon Redshift Specialist Options Architect at AWS. She is keen about serving to prospects construct scalable and strong knowledge and analytics options. Earlier than AWS, she labored in a number of completely different knowledge engineering and analytics roles. Exterior of labor, she enjoys panorama pictures, touring, and board video games.
Julia Beck is an Analytics Specialist Options Architect at AWS. She is keen about supporting prospects in validating and optimizing analytics options by architecting proof of idea workloads designed to fulfill their particular wants.