Use the AWS CDK with the Knowledge Options Framework to provision and handle Amazon Redshift Serverless

In February 2024, we introduced the discharge of the Knowledge Options Framework (DSF), an opinionated open supply framework for constructing information options on AWS. DSF is constructed utilizing the AWS Cloud Improvement Equipment (AWS CDK) to bundle infrastructure elements into L3 AWS CDK constructs on high of AWS companies. L3 constructs are implementations of frequent technical patterns and create a number of sources which might be configured to work with one another.

On this publish, we exhibit methods to use the AWS CDK and DSF to create a multi-data warehouse platform primarily based on Amazon Redshift Serverless. DSF simplifies the provisioning of Redshift Serverless, initialization and cataloging of knowledge, and information sharing between totally different information warehouse deployments. Utilizing a programmatic strategy with the AWS CDK and DSF means that you can apply GitOps rules to your analytics workloads and understand the next advantages:

  • You possibly can deploy utilizing steady integration and supply (CI/CD) pipelines, together with the definitions of Redshift objects (databases, tables, shares, and so forth)
  • You possibly can roll out modifications constantly throughout a number of environments
  • You possibly can bootstrap information warehouses (desk creation, ingestion of knowledge, and so forth) utilizing code and use model management to simplify the setup of testing environments
  • You possibly can take a look at modifications earlier than deployment utilizing AWS CDK built-in testing capabilities

As well as, DSF’s Redshift Serverless L3 constructs present numerous built-in capabilities that may speed up growth whereas serving to you comply with finest practices. For instance:

  • Operating extract, rework, and cargo (ETL) jobs to and from Amazon Redshift is extra easy as a result of an AWS Glue connection useful resource is routinely created and configured. This implies information engineers don’t must configure this useful resource and might use it straight away with their AWS Glue ETL jobs.
  • Equally, with discovery of knowledge inside Amazon Redshift, DSF supplies a handy methodology to configure an AWS Glue crawler to populate the AWS Glue Knowledge Catalog for ease of discovery in addition to ease of referencing tables when creating ETL jobs. The configured AWS Glue crawler makes use of an AWS Id and Entry Administration (IAM) position that follows least privilege.
  • Sharing information between Redshift information warehouses is a typical strategy to enhance collaboration between strains of enterprise with out duplicating information. DSF supplies handy strategies for the end-to-end move for each information producer and shopper.

Answer overview

The answer demonstrates a typical sample the place a knowledge warehouse is used as a serving layer for enterprise intelligence (BI) workloads on high of knowledge lake information. The supply information is saved in Amazon Easy Storage Service (Amazon S3) buckets, then ingested right into a Redshift producer information warehouse to create materialized views and mixture information, and at last shared with a Redshift shopper working BI queries from the end-users. The next diagram illustrates the high-level structure.

Use the AWS CDK with the Knowledge Options Framework to provision and handle Amazon Redshift Serverless

Within the publish, we use Python for the instance code. DSF additionally helps TypeScript.

Stipulations

As a result of we’re utilizing the AWS CDK, full the steps in Getting Began with the AWS CDK earlier than you implement the answer.

Initialize the undertaking and provision a Redshift Serverless namespace and workgroup

Let’s begin with initializing the undertaking and together with DSF as a dependency. You possibly can run this code in your native terminal, or you should utilize AWS Cloud9:

mkdir dsf-redshift-blog && cd dsf-redshift-blog
cdk init --language python

Open the undertaking folder in your IDE and full the next steps:

  1. Open the app.py file.
  2. On this file, be certain that to uncomment the primary env This configures the AWS CDK surroundings relying on the AWS profile used through the deployment.
  3. Add a configuration flag within the cdk.context.json file on the root of the undertaking (if it doesn’t exist, create the file):
    {  
        "@data-solutions-framework-on-aws/removeDataOnDestroy": true 
    }

Setting the @data-solutions-framework-on-aws/removeDataOnDestroy configuration flag to true makes positive sources which have the removal_policy parameter set to RemovalPolicy.DESTROY are destroyed when the AWS CDK stack is deleted. It is a guardrail DSF makes use of to stop unintentionally deleting information.

Now that the undertaking is configured, you can begin including sources to the stack.

  1. Navigate to the dsf_redshift_blog folder and open the dsf_redshift_blog_stack.py file.

That is the place we configure the sources to be deployed.

  1. To get began constructing the end-to-end demo, add the next import statements on the high of the file, which lets you begin defining the sources from each the AWS CDK core library in addition to DSF:
    from aws_cdk import (
        RemovalPolicy,
        Stack
    )
    
    from aws_cdk.aws_s3 import Bucket
    from aws_cdk.aws_iam import Function, ServicePrincipal
    from constructs import Assemble
    from cdklabs import aws_data_solutions_framework as dsf

We use a number of DSF-specific constructs to construct the demo:

  • DataLakeStorage – This creates three S3 buckets, named Bronze, Silver, and Gold, to signify the totally different information layers.
  • S3DataCopy – This manages the copying of knowledge from one bucket to a different bucket.
  • RedshiftServerlessNamespace – This creates a Redshift Serverless namespace the place database objects and customers are saved.
  • RedshiftServerlessWorkgroup – This creates a Redshift Serverless workgroup that accommodates compute- and network-related configurations for the information warehouse. That is additionally the entry level for a number of handy functionalities that DSF supplies, akin to cataloging of Redshift tables, working SQL statements as a part of the AWS CDK (akin to creating tables, information ingestion, merging of tables, and extra), and sharing datasets throughout totally different Redshift clusters with out transferring information.
  1. Now that you’ve got imported the libraries, create a set of S3 buckets following the medallion structure finest practices with bronze, silver, and gold information layers.

The high-level definitions of every layer are as follows:

  • Bronze represents uncooked information; that is the place information from varied supply methods lands. No schema is required.
  • Silver is cleaned and doubtlessly augmented information. The schema is enforced on this layer.
  • Gold is information that’s additional refined and aggregated to serve a selected enterprise want.

Utilizing the DataLakeStorage assemble, you possibly can create these three S3 buckets with the next finest practices:

  • Encryption at relaxation via AWS Key Administration Service (AWS KMS) is turned on
  • SSL is enforced
  • The usage of S3 bucket keys is turned on
  • There’s a default S3 lifecycle rule outlined to delete incomplete multipart uploads after 1 day
    data_lake = dsf.storage.DataLakeStorage(self,
        'DataLake',
        removal_policy=RemovalPolicy.DESTROY)

  1. After you create the S3 buckets, copy over the information utilizing the S3DataCopy For this demo, we land the information within the Silver bucket as a result of it’s already cleaned:
    source_bucket = Bucket.from_bucket_name(self, 
        'SourceBucket', 
        bucket_name="redshift-immersionday-labs")
    
    data_copy = dsf.utils.S3DataCopy(self,
        'SourceData', 
        source_bucket=source_bucket, 
        source_bucket_prefix='information/amazon-reviews/', 
        source_bucket_region='us-west-2', 
        target_bucket=data_lake.silver_bucket, 
        target_bucket_prefix='silver/amazon-reviews/')

  2. To ensure that Amazon Redshift to ingest the information in Amazon S3, it wants an IAM position with the proper permissions. This position shall be related to the Redshift Serverless namespace that you just create subsequent.
    lake_role = Function(self, 
        'LakeRole', 
        assumed_by=ServicePrincipal('redshift.amazonaws.com'))
    
    data_lake.silver_bucket.grant_read(lake_role)

  3. To provision Redshift Serverless, configure two sources: a namespace and a workgroup. DSF supplies L3 constructs for each:
    1. RedshiftServerlessNamespace
    2. RedshiftServerlessWorkgroup

    Each constructs comply with safety finest practices, together with:

    • The default digital personal cloud (VPC) makes use of personal subnets (with public entry disabled).
    • Knowledge is encrypted at relaxation via AWS KMS with computerized key rotation.
    • Admin credentials are saved in AWS Secrets and techniques Supervisor with computerized rotation managed by Amazon Redshift.
    • A default AWS Glue connection is routinely created utilizing personal connectivity. This can be utilized by AWS Glue crawlers in addition to AWS Glue ETL jobs to hook up with Amazon Redshift.

    The RedshiftServerlessWorkgroup assemble is the primary entry level for different capabilities, akin to integration with the AWS Glue Knowledge Catalog, Redshift Knowledge API, and Knowledge Sharing API.

    1. Within the following instance, use the defaults offered by the assemble and affiliate the IAM position that you just created earlier to present Amazon Redshift entry to the information lake for information ingestion:
      namespace = dsf.consumption.RedshiftServerlessNamespace(self, 
          'Namespace', 
          db_name="defaultdb", 
          identify="producer", 
          removal_policy=RemovalPolicy.DESTROY, 
          default_iam_role=lake_role)
      
      workgroup = dsf.consumption.RedshiftServerlessWorkgroup(self, 
          'Workgroup', 
          identify="producer", 
          namespace=namespace, 
          removal_policy=RemovalPolicy.DESTROY)

Create tables and ingest information

To create a desk, you should utilize the runCustomSQL methodology within the RedshiftServerlessWorkgroup assemble. This methodology means that you can run arbitrary SQL statements when the useful resource is being created (akin to create desk or create materialized view) and when it’s being deleted (akin to drop desk or drop materialized view).

Add the next code after the RedshiftServerlessWorkgroup instantiation:

create_amazon_reviews_table = workgroup.run_custom_sql('CreateAmazonReviewsTable', 
    database_name="defaultdb", 
    sql="CREATE TABLE amazon_reviews (market character various(16383) ENCODE lzo, customer_id character various(16383) ENCODE lzo, review_id character various(16383) ENCODE lzo, product_id character various(16383) ENCODE lzo, product_parent character various(16383) ENCODE lzo, product_title character various(16383) ENCODE lzo, star_rating integer ENCODE az64, helpful_votes integer ENCODE az64, total_votes integer ENCODE az64, vine character various(16383) ENCODE lzo, verified_purchase character various(16383) ENCODE lzo, review_headline character various(max) ENCODE lzo, review_body character various(max) ENCODE lzo, review_date date ENCODE az64, 12 months integer ENCODE az64) DISTSTYLE AUTO;", 
    delete_sql="drop desk amazon_reviews")

load_amazon_reviews_data = workgroup.ingest_data('amazon_reviews_ingest_data', 
    'defaultdb', 
    'amazon_reviews', 
    data_lake.silver_bucket, 
    'silver/amazon-reviews/', 
    'FORMAT parquet')

load_amazon_reviews_data.node.add_dependency(create_amazon_reviews_table)
load_amazon_reviews_data.node.add_dependency(data_copy)

Given the asynchronous nature of a few of the useful resource creation, we additionally implement dependencies between some sources; in any other case, the AWS CDK would attempt to create them in parallel to speed up the deployment. The previous dependency statements set up the next:

  • Earlier than you load the information, the S3 information copy is full, so the information exists within the supply bucket of the ingestion
  • Earlier than you load the information, the goal desk has been created within the Redshift namespace

Bootstrapping instance (materialized views)

The workgroup.run_custom_sql() methodology supplies flexibility in how one can bootstrap your Redshift information warehouse utilizing the AWS CDK. For instance, you possibly can create a materialized view to enhance the queries’ efficiency by pre-aggregating information from the Amazon opinions:

materialized_view = workgroup.run_custom_sql('MvProductAnalysis',
    database_name="defaultdb",
    sql=f'''CREATE MATERIALIZED VIEW mv_product_analysis AS SELECT review_date, product_title, COUNT(1) AS review_total, SUM(star_rating) AS ranking FROM amazon_reviews WHERE market="US" GROUP BY 1,2;''',
    delete_sql="drop materialized view mv_product_analysis")

materialized_view.node.add_dependency(load_amazon_reviews_data)

Catalog tables in Amazon Redshift

The deployment of RedshiftServerlessWorkgroup routinely creates an AWS Glue connection useful resource that can be utilized by AWS Glue crawlers and AWS Glue ETL jobs. That is straight uncovered from the workgroup assemble via the glue_connection property. Utilizing this connection, the workgroup assemble exposes a handy methodology to catalog the tables contained in the related Redshift Serverless namespace. The next an instance code:

workgroup.catalog_tables('DefaultDBCatalog', 'mv_product_analysis')

This single line of code creates a database within the Knowledge Catalog named mv_product_analysis and the related crawler with the IAM position and community configuration already configured. By default, it crawls all of the tables inside the general public schema within the default database indicated when the Redshift Serverless namespace was created. To override this, the third parameter within the catalogTables methodology means that you can outline a sample on what to crawl (see the JDBC information retailer within the embrace path).

You possibly can run the crawler utilizing the AWS Glue console or invoke it utilizing the SDK, AWS Command Line Interface (AWS CLI), or AWS CDK utilizing AwsCustomResource.

Knowledge sharing

DSF helps Redshift information sharing for each side (producers and customers) in addition to similar account and cross-account situations. Let’s create one other Redshift Serverless namespace and workgroup to exhibit the interplay:

namespace2 = dsf.consumption.RedshiftServerlessNamespace(self, 
    "Namespace2", 
    db_name="defaultdb", 
    identify="shopper", 
    default_iam_role=lake_role, 
    removal_policy=RemovalPolicy.DESTROY)

workgroup2 = dsf.consumption.RedshiftServerlessWorkgroup(self, 
    "Workgroup2", 
    identify="shopper", 
    namespace=namespace2, 
    removal_policy=RemovalPolicy.DESTROY)

For producers

For producers, full the next steps:

  1. Create the brand new share and populate the share with the schema or tables:
    data_share = workgroup.create_share('DataSharing', 
        'defaultdb', 
        'defaultdbshare', 
        'public', ['mv_product_analysis'])
    
    data_share.new_share_custom_resource.node.add_dependency(materialized_view)
  2. Create entry grants:
    • To grant to a cluster in the identical account:
      share_grant = workgroup.grant_access_to_share("GrantToSameAccount", 
          data_share, 
          namespace2.namespace_id)
      
      share_grant.useful resource.node.add_dependency(data_share.new_share_custom_resource)
      share_grant.useful resource.node.add_dependency(namespace2)
    • To grant to a special account:
      workgroup.grant_access_to_share('GrantToDifferentAccount', 
          tpcdsShare, 
          undefined, 
          '<ACCOUNT_ID_OF_CONSUMER>', 
          true)

The final parameter within the grant_access_to_share methodology permits to routinely authorize the cross-account entry on the information share. Omitting this parameter would default to no authorization; which suggests a Redshift administrator must authorize the cross-account share both utilizing the AWS CLI, SDK, or Amazon Redshift console.

For customers

For a similar account share, to create the database from the share, use the next code:

create_db_from_share = workgroup2.create_database_from_share("CreateDatabaseFromShare", 
    "advertising", 
    data_share.data_share_name, 
    data_share.producer_namespace)

create_db_from_share.useful resource.node.add_dependency(share_grant.useful resource)
create_db_from_share.useful resource.node.add_dependency(workgroup2)

For cross-account grants, the syntax is comparable, however you’ll want to point out the producer account ID:

consumerWorkgroup.create_database_from_share('CreateCrossAccountDatabaseFromShare', 
    'tpcds', 
    <PRODUCER_SHARE_NAME>, 
    <PRODUCER_NAMESPACE_ID>, 
    <PRODUCER_ACCOUNT_ID>)

To see the total working instance, comply with the directions within the accompanying GitHub repository.

Deploy the sources utilizing the AWS CDK

To deploy the sources, run the next code:

You possibly can evaluation the sources created, as proven within the following screenshot.

Verify the modifications for the deployment to start out. Wait a couple of minutes for the undertaking to be deployed; you possibly can preserve observe of the deployment utilizing the AWS CLI or the AWS CloudFormation console.

When the deployment is full, it’s best to see two Redshift workgroups (one producer and one shopper).

Utilizing Amazon Redshift Question Editor v2, you possibly can log in to the producer Redshift workgroup utilizing Secrets and techniques Supervisor, as proven within the following screenshot.

Producer QEV2 Login

After you log in, you possibly can see the tables and views that you just created utilizing DSF within the defaultdb database.

QEv2 Tables

Log in to the buyer Redshift workgroup to see the shared dataset from the producer Redshift workgroup below the advertising database.

Clear up

You possibly can run cdk destroy in your native terminal to delete the stack. Since you marked the constructs with a RemovalPolicy.DESTROY and configured DSF to take away information on destroy, working cdk destroy or deleting the stack from the AWS CloudFormation console will clear up the provisioned sources.

Conclusion

On this publish, we demonstrated methods to use the AWS CDK together with the DSF to handle Redshift Serverless as code. Codifying the deployment of sources helps present consistency throughout a number of environments. Apart from infrastructure, DSF additionally supplies capabilities to bootstrap (desk creation, ingestion of knowledge, and extra) Amazon Redshift and handle objects, all from the AWS CDK. Which means that modifications might be model managed, reviewed, and even unit examined.

Along with Redshift Serverless, DSF helps different AWS companies, akin to Amazon Athena, Amazon EMR, and plenty of extra. Our roadmap is publicly out there, and we sit up for your characteristic requests, contributions, and suggestions.

You will get began utilizing DSF by following our fast begin information.


Concerning the authors


Jan Michael Go Tan is a Principal Options Architect for Amazon Internet Providers. He helps prospects design scalable and modern options with the AWS Cloud.
Vincent Gromakowski is an Analytics Specialist Options Architect at AWS the place he enjoys fixing prospects’ analytics, NoSQL, and streaming challenges. He has a robust experience on distributed information processing engines and useful resource orchestration platform.

Leave a Reply

Your email address will not be published. Required fields are marked *