Migrate workloads from AWS Knowledge Pipeline

AWS Knowledge Pipeline helps prospects automate the motion and transformation of information. With Knowledge Pipeline, prospects can outline data-driven workflows, in order that duties may be depending on the profitable completion of earlier duties. Launched in 2012, Knowledge Pipeline predates a number of widespread Amazon Internet Providers (AWS) choices for orchestrating information pipelines corresponding to AWS Glue, AWS Step Features, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

Knowledge Pipeline has been a foundational service for getting buyer off the bottom for his or her extract, rework, load (ETL) and infra provisioning use instances. Some prospects need a deeper degree of management and specificity than potential utilizing Knowledge Pipeline. With the current developments within the information business, prospects are in search of a extra feature-rich platform to modernize their information pipelines to get them prepared for information and machine studying (ML) innovation. This publish explains the right way to migrate from Knowledge Pipeline to alternate AWS companies to serve the rising wants of information practitioners. The choice you select is dependent upon your present workload on Knowledge Pipeline. You may migrate typical use instances of Knowledge Pipeline to AWS Glue, Step Features, or Amazon MWAA.

Be aware that you’ll want to change the configurations and code within the examples supplied on this publish based mostly in your necessities. Earlier than beginning any manufacturing workloads after migration, that you must check your new workflows to make sure no disruption to manufacturing techniques.

Migrating workloads to AWS Glue

AWS Glue is a serverless information integration service that helps analytics customers to find, put together, transfer, and combine information from a number of sources. It contains tooling for authoring, operating jobs, and orchestrating workflows. With AWS Glue, you may uncover and connect with lots of of various information sources and handle your information in a centralized information catalog. You may visually create, run, and monitor ETL pipelines to load information into your information lakes. Additionally, you may instantly search and question cataloged information utilizing Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

We advocate migrating your Knowledge Pipeline workload to AWS Glue when:

  • You’re in search of a serverless information integration service that helps numerous information sources, authoring interfaces together with visible editors and notebooks, and superior information administration capabilities corresponding to information high quality and delicate information detection.
  • Your workload may be migrated to AWS Glue workflows, jobs (in Python or Apache Spark) and crawlers (for instance, your current pipeline is constructed on prime of Apache Spark).
  • You want a single platform that may deal with all points of your information pipeline, together with ingestion, processing, switch, integrity testing, and high quality checks.
  • Your current pipeline was created from a pre-defined template on the AWS Administration Console for Knowledge Pipeline, corresponding to exporting a DynamoDB desk to Amazon S3, or importing DynamoDB backup information from S3, and also you’re in search of the identical template.
  • Your workload doesn’t depend upon a selected Hadoop ecosystem software corresponding to Apache Hive.
  • Your workload doesn’t require orchestrating on-premises servers, user-managed Amazon Elastic Compute Cloude (Amazon EC2) cases, or a user-managed Amazon EMR cluster.

Instance: Migrate EmrActivity on EmrCluster to export DynamoDB tables to S3

One of the vital frequent workloads on Knowledge Pipeline is to backup Amazon DynamoDB tables to Amazon Easy Storage Service (Amazon S3). Knowledge Pipeline has a pre-defined template named Export DynamoDB desk to S3 to export DynamoDB desk information to a given S3 bucket.

The template makes use of EmrActivity (named TableBackupActivity) which runs on EmrCluster (named EmrClusterForBackup) and backs up information on DynamoDBDataNode to S3DataNode.

You may migrate these pipelines to AWS Glue as a result of it natively helps studying from DynamoDB.

To outline an AWS Glue job for the previous use case:

  1. Open the AWS Glue console.
  2. Select ETL jobs.
  3. Select Visible ETL.
  4. For Sources, choose Amazon DynamoDB.
  5. On the node Knowledge supply - DynamoDB, for DynamoDB supply, choose Select the DynamoDB desk immediately, then choose your supply DynamoDB desk from the menu.
  6. For Connection choices, enter s3.bucket and dynamodb.s3.prefix.
  7. Select + (plus) so as to add a brand new node.
  8. For Targets, choose Amazon S3.
  9. On the node Knowledge goal - S3 bucket, for Format, choose your most well-liked format, for instance, Parquet.
  10. For S3 Goal location, enter your vacation spot S3 path.
  11. On Job particulars tab, choose IAM position. In case you would not have the IAM position, comply with Configuring IAM permissions for AWS Glue.
  12. Select Save and Run.

Your AWS Glue job has been efficiently created and began.

You would possibly discover that there isn’t a property to handle learn I/O price. It’s as a result of the default DynamoDB reader utilized in Glue Studio doesn’t scan the supply DynamoDB desk. As a substitute it makes use of DynamoDB export.

Instance: Migrate EmrActivity on EmrCluster to import DynamoDB from S3

One other frequent workload on Knowledge Pipeline is to revive DynamoDB tables utilizing backup information on Amazon S3. Knowledge Pipeline has a pre-defined template named Import DynamoDB backup information from S3 to import DynamoDB desk information from a given S3 bucket.

The template makes use of EmrActivity (named TableLoadActivity) which runs on EmrCluster (named EmrClusterForLoad) and hundreds information from S3DataNode to DynamoDBDataNode.

You may migrate these pipelines to AWS Glue as a result of it natively helps writing to DynamoDB.

Stipulations are to create a vacation spot DynamoDB desk and catalog it on Glue Knowledge Catalog utilizing Glue crawler, Glue console, or the API.

  1. Open the AWS Glue console.
  2. Select ETL jobs.
  3. Select Visible ETL.
  4. For Sources, choose Amazon S3.
  5. On the node Knowledge supply - S3 bucket, for S3 URL, enter your S3 path.
  6. Select + (plus) so as to add a brand new node.
  7. For Targets, choose AWS Glue Knowledge Catalog.
  8. On the node Knowledge goal - Knowledge Catalog, for Database, choose your vacation spot database on Knowledge Catalog.
  9. For Desk, choose your vacation spot desk on Knowledge Catalog.
  10. On Job particulars tab, choose IAM position. In case you would not have the IAM position, comply with Configuring IAM permissions for AWS Glue.
  11. Select Save and Run.

Your AWS Glue job has been efficiently created and began.

Migrating workloads to Step Features

AWS Step Features is a serverless orchestration service that allows you to construct workflows on your business-critical functions. With Step Features, you utilize a visible editor to construct workflows and combine immediately with over 11,000 actions for over 250 AWS companies, together with AWS Lambda, Amazon EMR, DynamoDB, and extra. You should use Step Features for orchestrating information processing pipelines, dealing with errors, and dealing with the throttling limits on the underlying AWS companies. You may create workflows that course of and publish machine studying fashions, orchestrate micro-services, in addition to management AWS companies, corresponding to AWS Glue, to create ETL workflows. You can also create long-running, automated workflows for functions that require human interplay.

We advocate migrating your Knowledge Pipeline workload to Step Features when:

  • You’re in search of a serverless, extremely obtainable workflow orchestration service.
  • You’re in search of a cheap answer that prices at single-task granularity.
  • Your workloads are orchestrating duties for a number of AWS companies, corresponding to Amazon EMR, AWS Lambda, AWS Glue, or DynamoDB.
  • You’re in search of a low-code answer that comes with a drag-and-drop visible designer for workflow creation and doesn’t require studying new programming ideas.
  • You’re in search of a service that gives integrations with over 250 AWS companies overlaying over 11,000 actions out-of-the-box, in addition to permitting integrations with customized non-AWS companies and actions.
  • Each Knowledge Pipeline and Step Features use JSON format to outline workflows. This lets you retailer your workflows in supply management, handle variations, management entry, and automate with steady integration and improvement (CI/CD). Step Features use a syntax referred to as Amazon State Language, which is absolutely based mostly on JSON and permits a seamless transition between the textual and visible representations of the workflow.
  • Your workload requires orchestrating on-premises servers, user-managed EC2 cases, or a user-managed EMR cluster.

With Step Features, you may select the identical model of Amazon EMR that you simply’re presently utilizing in Knowledge Pipeline.

For migrating actions on Knowledge Pipeline managed assets, you should use AWS SDK service integration on Step Features to automate useful resource provisioning and cleansing up. For migrating actions on on-premises servers, user-managed EC2 cases, or a user-managed EMR cluster, you may set up an SSM agent to the occasion. You may provoke the command via the AWS Techniques Supervisor Run Command from Step Features. You can even provoke the state machine from the schedule outlined in Amazon EventBridge.

Instance: Migrate HadoopActivity on EmrCluster

Emigrate HadoopActivity on EmrCluster on Knowledge Pipeline to Step Features:

  1. Open the AWS Step Features console.
  2. Select State machines.
  3. Select Create state machine.
  4. Within the Select a template wizard, seek for emr, choose Handle an EMR job, and select Choose.
  1. For Select the right way to use this template, choose Construct on it.
  2. Select Use template.
  1. For Create an EMR cluster state, configure API Parameters based mostly on the EMR launch label, EMR capability, IAM position, and so forth based mostly on the present EmrClusternode configuration on Knowledge Pipeline.
  1. For Run first step state, configure API Parameters based mostly on the JAR file and arguments based mostly on the present HadoopActivity node configuration on Knowledge Pipeline.
  2. When you have additional actions configured on the present HadoopActivity, repeat step 8.
  3. Select Create.

Your state machine has been efficiently configured. Study extra in Handle an Amazon EMR Job.

Migrating workloads to Amazon MWAA

Amazon MWAA is a managed orchestration service for Apache Airflow that allows you to use the Apache Airflow platform to arrange and function end-to-end information pipelines within the cloud at scale. Apache Airflow is an open-source software used to programmatically writer, schedule, and monitor sequences of processes and duties known as workflows. Apache Airflow brings in new ideas like executors, swimming pools, and SLAs that give you superior information orchestration capabilities. With Amazon MWAA, you should use Airflow and Python programming language to create workflows with out having to handle the underlying infrastructure for scalability, availability, and safety. Amazon MWAA mechanically scales its workflow runtime capability to satisfy your wants and is built-in with AWS safety companies to assist give you quick and safe entry to your information.

We advocate migrating your Knowledge Pipeline workloads to Amazon MWAA when:

  • You’re in search of a managed, extremely obtainable service to orchestrate workflows written in Python.
  • You wish to transition to a totally managed, extensively adopted open supply expertise—Apache Airflow—for optimum portability.
  • You require a single platform that may deal with all points of your information pipeline, together with ingestion, processing, switch, integrity testing, and high quality checks.
  • You’re in search of a service designed for information pipeline orchestration with options corresponding to wealthy UI for observability, restarts for failed workflows, backfills, retries for duties, and lineage assist with OpenLineage.
  • You’re in search of a service that comes with greater than 1,000 pre-built operators and sensors, overlaying AWS in addition to non-AWS companies.
  • Your workload requires orchestrating on-premises servers, user-managed EC2 cases, or a user-managed EMR cluster.

Amazon MWAA workflows are outlined as directed acyclic graphs (DAGs) utilizing Python, so you can too deal with them as supply code. Airflow’s extensible Python framework lets you construct workflows connecting with nearly any expertise. It comes with a wealthy person interface for viewing and monitoring workflows and may be simply built-in with model management techniques to automate the CI/CD course of. With Amazon MWAA, you may select the identical model of Amazon EMR that you simply’re presently utilizing in Knowledge Pipeline.

Instance: Migrate HadoopActivity on EmrCluster

Full the next steps in case you would not have current MWAA environments:

  1. Create an AWS CloudFormation template in your laptop by copying the template from the short begin information into a neighborhood textual content file.
  2. On the CloudFormation console, select Stacks within the navigation pane.
  3. Select Create stack with the choice With new assets (normal).
  4. Select Add a template file and choose the native template file.
  5. Select Subsequent.
  6. Full the setup steps, getting into a reputation for the atmosphere, and depart the remainder of the parameters as default.
  7. On the final step, acknowledge that assets shall be created and select Submit.

The creation can take 20–half-hour, till the standing of the stack adjustments to CREATE_COMPLETE. The useful resource that can take probably the most time is the Airflow atmosphere. Whereas it’s being created, you may proceed with the next steps, till you’re required to open the Airflow UI.

An Airflow workflow is predicated on a DAG, which is outlined by a Python file that programmatically specifies the completely different duties concerned and its interdependencies. Full the next scripts to create the DAG:

  1. Create a neighborhood file named emr_dag.py utilizing a textual content editor with following snippets, and configure the EMR associated parameters based mostly on the present Knowledge Pipeline definition:
    from airflow import DAG
    from airflow.suppliers.amazon.aws.operators.emr import (
        EmrCreateJobFlowOperator,
        EmrAddStepsOperator,
    )
    from airflow.suppliers.amazon.aws.sensors.emr import EmrStepSensor
    from airflow.utils.dates import days_ago
    from datetime import timedelta
    import os
    DAG_ID = os.path.basename(__file__).change(".py", "")
    SPARK_STEPS = [
        {
            'Name': 'calculate_pi',
            'ActionOnFailure': 'CONTINUE',
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['spark-example', 'SparkPi', '10'],
            },
        }
    ]
    JOB_FLOW_OVERRIDES = {
        'Identify': 'my-demo-cluster',
        'ReleaseLabel': 'emr-6.1.0',
        'Purposes': [
            {
                'Name': 'Spark'
            },
        ],
        'Cases': {
            'InstanceGroups': [
                {
                    'Name': "Master nodes",
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'MASTER',
                    'InstanceType': 'm5.xlarge',
                    'InstanceCount': 1,
                },
                {
                    'Name': "Slave nodes",
                    'Market': 'ON_DEMAND',
                    'InstanceRole': 'CORE',
                    'InstanceType': 'm5.xlarge',
                    'InstanceCount': 2,
                }
            ],
            'KeepJobFlowAliveWhenNoSteps': False,
            'TerminationProtected': False,
        },
        'VisibleToAllUsers': True,
        'JobFlowRole': 'EMR_EC2_DefaultRole',
        'ServiceRole': 'EMR_DefaultRole'
    }
    with DAG(
        dag_id=DAG_ID,
        start_date=days_ago(1),
        schedule_interval="@as soon as",
        dagrun_timeout=timedelta(hours=2),
        catchup=False,
        tags=['emr'],
    ) as dag:
        cluster_creator = EmrCreateJobFlowOperator(
            task_id='create_job_flow',
            job_flow_overrides=JOB_FLOW_OVERRIDES,
            aws_conn_id='aws_default',
        )
        step_adder = EmrAddStepsOperator(
            task_id='add_steps',
            job_flow_id=cluster_creator.output,
            aws_conn_id='aws_default',
            steps=SPARK_STEPS,
        )
        step_checker = EmrStepSensor(
            task_id='watch_step',
            job_flow_id=cluster_creator.output,
            step_id="{{ task_instance.xcom_pull(task_ids="add_steps")[0] }}",
            aws_conn_id='aws_default',
        )
        cluster_creator >> step_adder >> step_checker

Defining the schedule in Amazon MWAA is so simple as updating the schedule_interval parameter for the DAG. For instance, to run the DAG each day, set schedule_interval="@each day".

Now, you create a workflow that invokes the Amazon EMR step you simply created:

  1. On the Amazon S3 console, find the bucket created by the CloudFormation template, which can have a reputation beginning with the identify of the stack adopted by -environmentbucket- (for instance, myairflowstack-environmentbucket-ap1qks3nvvr4).
  2. Inside that bucket, create a folder referred to as dags, and inside that folder, add the DAG file emr_dag.py that you simply created within the earlier part.
  3. On the Amazon MWAA console, navigate to the atmosphere you deployed with the CloudFormation stack.

If the standing shouldn’t be but Accessible, wait till it reaches that state. It shouldn’t take longer than half-hour after you deployed the CloudFormation stack.

  1. Select the atmosphere hyperlink on the desk to see the atmosphere particulars.

It’s configured to select up DAGs from the bucket and folder you used within the earlier steps. Airflow will monitor that folder for adjustments.

  1. Select Open Airflow UI to open a brand new tab accessing the Airflow UI, utilizing the built-in IAM safety to signal you in.

If there are points with the DAG file you created, it’s going to show an error on prime of the web page indicating the traces affected. In that case, overview the steps and add once more. After a couple of seconds, it’s going to parse it and replace or take away the error banner.

Clear up

After you migrate your current Knowledge Pipeline workload and confirm that the migration was profitable, delete your pipelines in Knowledge Pipeline to cease additional runs and billing.

Conclusion

On this weblog publish, we outlined a couple of alternate AWS companies for migrating your current Knowledge Pipeline workloads. You may migrate to AWS Glue to run and orchestrate Apache Spark functions, AWS Step Features to orchestrate workflows involving numerous different AWS companies, or Amazon MWAA to assist handle workflow orchestration utilizing Apache Airflow. By migrating, it is possible for you to to run your workloads with a broader vary of information integration functionalities. When you have extra questions, publish within the feedback or examine migration examples in our documentation.


Concerning the authors

Noritaka Sekiyama is a Principal Massive Knowledge Architect on the AWS Glue staff and AWS Knowledge Pipeline staff. He’s accountable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking together with his highway bike.

Vaibhav Porwal is a Senior Software program Improvement Engineer on the AWS Glue and AWS Knowledge Pipeline staff. He’s engaged on fixing issues in orchestration house by constructing low value, repeatable, scalable workflow techniques that permits prospects to create their ETL pipelines seamlessly.

Sriram Ramarathnam is a Software program Improvement Supervisor on the AWS Glue and AWS Knowledge Pipeline staff. His staff works on fixing difficult distributed techniques issues for information integration throughout AWS serverless and serverfull compute choices.

Matt Su is a Senior Product Supervisor on the AWS Glue staff and AWS Knowledge Pipeline staff. He enjoys serving to prospects uncover insights and make higher selections utilizing their information with AWS Analytics companies. In his spare time, he enjoys snowboarding and gardening.

Leave a Reply

Your email address will not be published. Required fields are marked *