Attribute Amazon EMR on EC2 prices to your end-users

Amazon EMR on EC2 is a managed service that makes it easy to run massive information processing and analytics workloads on AWS. It simplifies the setup and administration of fashionable open supply frameworks like Apache Hadoop and Apache Spark, permitting you to concentrate on extracting insights from massive datasets relatively than the underlying infrastructure. With Amazon EMR, you may benefit from the ability of those massive information instruments to course of, analyze, and acquire helpful enterprise intelligence from huge quantities of information.

Price optimization is likely one of the pillars of the Properly-Architected Framework. It focuses on avoiding pointless prices, choosing essentially the most applicable useful resource sorts, analyzing spend over time, and scaling out and in to satisfy enterprise wants with out overspending. An optimized workload maximizes the usage of all accessible sources, delivers the specified consequence on the most cost-effective value level, and meets your practical wants.

The present Amazon EMR pricing web page reveals the estimated value of the cluster. You can even use AWS Price Explorer to get extra detailed details about your prices. These views provide you with an general image of your Amazon EMR prices. Nevertheless, it’s possible you’ll have to attribute prices on the particular person Spark job degree. For instance, you would possibly need to know the utilization value in Amazon EMR for the finance enterprise unit. Or, for chargeback functions, you would possibly have to mixture the price of Spark purposes by practical space. After you have got allotted prices to particular person Spark jobs, this information may also help you make knowledgeable choices to optimize your prices. For example, you can select to restructure your purposes to make the most of fewer sources. Alternatively, you would possibly choose to discover totally different pricing fashions like Amazon EMR on EKS or Amazon EMR Serverless.

On this submit, we share a chargeback mannequin that you need to use to trace and allocate the prices of Spark workloads operating on Amazon EMR on EC2 clusters. We describe an strategy that assigns Amazon EMR prices to totally different jobs, groups, or traces of enterprise. You should utilize this function to distribute prices throughout varied enterprise items. This may help you in monitoring the return on funding to your Spark-based workloads.

Resolution overview

The answer is designed that will help you observe the price of your Spark purposes operating on EMR on EC2. It will probably show you how to determine value optimizations and enhance the cost-efficiency of your EMR clusters.

The proposed resolution makes use of a scheduled AWS Lambda perform that operates every day. The perform captures utilization and price metrics, that are subsequently saved in Amazon Relational Database Service (Amazon RDS) tables. The information saved within the RDS tables is then queried to derive chargeback figures and generate reporting developments utilizing Amazon QuickSight. The utilization of those AWS providers incurs further prices for implementing this resolution. Alternatively, you may take into account an strategy that entails a cron-based agent script put in in your present EMR cluster, if you wish to keep away from the usage of further AWS providers and related prices for constructing your chargeback resolution. This script shops the related metrics in an Amazon Easy Storage Service (Amazon S3) bucket, and makes use of Python Jupyter notebooks to generate chargeback numbers primarily based on the info recordsdata saved in Amazon S3, utilizing AWS Glue tables.

The next diagram reveals the present resolution structure.

Attribute Amazon EMR on EC2 prices to your end-users

The workflow consists of the next steps:

  1. A Lambda perform will get the next parameters from Parameter Retailer, a functionality of AWS Programs Supervisor:
    {
      "yarn_url": "http://dummy.compute-1.amazonaws.com:8088/ws/v1/cluster/apps",
      "tbl_applicationlogs_lz": "public.emr_applications_execution_log_lz",
      "tbl_applicationlogs": "public.emr_applications_execution_log",
      "tbl_emrcost": "public.emr_cluster_usage_cost",
      "tbl_emrinstance_usage": "public.emr_cluster_instances_usage",
      "emrcluster_id": "j-xxxxxxxxxx",
      "emrcluster_name": "EMR_Cost_Measure",
      "emrcluster_role": "dt-dna-shared",
      "emrcluster_linkedaccount": "xxxxxxxxxxx",
      "postgres_rds": {
        "host": "xxxxxxxxx.amazonaws.com",
        "dbname": "postgres",
        "consumer": "postgresadmin",
        "secretid": "postgressecretid"
      }
    }

  2. The Lambda perform extracts Spark software run logs from the EMR cluster utilizing the Useful resource Supervisor API. The next metrics are extracted as a part of the method: vcore-seconds, reminiscence MB-seconds, and storage GB-seconds.
  3. The Lambda perform captures the each day value of EMR clusters from Price Explorer.
  4. The Lambda perform additionally extracts EMR On-Demand and Spot Occasion utilization information utilizing the Amazon Elastic Compute Cloud (Amazon EC2) Boto3 APIs.
  5. Lambda perform masses these datasets into an RDS database.
  6. The price of operating a Spark software is set by the quantity of CPU sources it makes use of, in comparison with the entire CPU utilization of all Spark purposes. This info is used to distribute the general value amongst totally different groups, enterprise traces, or EMR queues.

The extraction course of runs each day, extracting yesterday’s information and storing it in an Amazon RDS for PostgreSQL desk. The historic information within the desk must be purged primarily based in your use case.

The answer is open supply and accessible on GitHub.

You should utilize the AWS Cloud Growth Package (AWS CDK) to deploy the Lambda perform, RDS for PostgreSQL information mannequin tables, and a QuickSight dashboard to trace EMR cluster value on the job, workforce, or enterprise unit degree.

The next schema present the tables used within the resolution that are queried by QuickSight to populate the dashboard.

  • emr_applications_execution_log_lz or public.emr_applications_execution_log – Storage for each day run metrics for all jobs run on the EMR cluster:
    • appdatecollect – Log assortment date
    • app_id – Spark job run ID
    • app_name – Run title
    • queue – EMR queue during which job was run
    • job_state – Job operating state
    • job_status – Job run ultimate standing (Succeeded or Failed)
    • starttime – Job begin time
    • endtime – Job finish time
    • runtime_seconds – Runtime in seconds
    • vcore_seconds – Consumed vCore CPU in seconds
    • memory_seconds – Reminiscence consumed
    • running_containers – Containers used
    • rm_clusterid – EMR cluster ID
  • emr_cluster_usage_cost – Captures Amazon EMR and Amazon EC2 each day value consumption from Price Explorer and masses the info into the RDS desk:
    • costdatecollect – Price assortment date
    • startdate – Price begin date
    • enddate – Price finish date
    • emr_unique_tag – EMR cluster related tag
    • net_unblendedcost – Complete unblended each day greenback value
    • unblendedcost – Complete unblended each day greenback value
    • cost_type – Each day value
    • service_name – AWS service for which the fee incurred (Amazon EMR and Amazon EC2)
    • emr_clusterid – EMR cluster ID
    • emr_clustername – EMR cluster title
    • loadtime – Desk load date/time
  • emr_cluster_instances_usage – Captures the aggregated useful resource utilization (vCores) and allotted sources for every EMR cluster node, and helps determine the idle time of the cluster:
    • instancedatecollect – Occasion utilization gather date
    • emr_instance_day_run_seconds – EMR occasion lively seconds within the day
    • emr_region – EMR cluster AWS Area
    • emr_clusterid – EMR cluster ID
    • emr_clustername – EMR cluster title
    • emr_cluster_fleet_type – EMR cluster fleet kind
    • emr_node_type – Occasion node kind
    • emr_market – Market kind (on-demand or provisioned)
    • emr_instance_type – Occasion dimension
    • emr_ec2_instance_id – Corresponding EC2 occasion ID
    • emr_ec2_status – Operating standing
    • emr_ec2_default_vcpus – Allotted vCPU
    • emr_ec2_memory – EC2 occasion reminiscence
    • emr_ec2_creation_datetime – EC2 occasion creation date/time
    • emr_ec2_end_datetime – EC2 occasion finish date/time
    • emr_ec2_ready_datetime – EC2 occasion prepared date/time
    • loadtime – Desk load date/time

Stipulations

You will need to have the next conditions earlier than implementing the answer:

  • An EMR on EC2 cluster.
  • The EMR cluster will need to have a singular tag worth outlined. You’ll be able to assign the tag immediately on the Amazon EMR console or utilizing Tag Editor. The really useful tag secret is cost-center together with a singular worth to your EMR cluster. After you create and apply user-defined tags, it may possibly take as much as 24 hours for the tag keys to look in your value allocation tags web page for activation
  • Activate the tag in AWS Billing. It takes about 24 hours to activate the tag if not performed earlier than. To activate the tag, comply with these steps:
    • On the AWS Billing and Price Administration console, select Price allocation tags from navigation pane.
    • Choose the tag key that you simply need to activate.
    • Select Activate.
  • The Spark software’s title ought to comply with the standardized naming conference. It consists of seven elements separated by underscores: <business_unit>_<program>_<software>_<supply>_<job_name>_<frequency>_<job_type>. These elements are used to summarize the useful resource consumption and price within the ultimate report. For instance: HR_PAYROLL_PS_PSPROD_TAXDUDUCTION_DLY_LD, FIN_CASHRECEIPT_GL_GLDB_MAIN_DLY_LD, or MKT_CAMPAIGN_CRM_CRMDB_TOPRATEDCAMPAIGN_DLY_LD. The appliance title should be provided with the spark submit command utilizing the --name parameter with the standardized naming conference. If any of those elements don’t have a worth, hardcode the values with the next steered names:
    • frequency
    • job_type
    • Business_unit
  • The Lambda perform ought to be capable to hook up with Price Explorer, hook up with the EMR cluster by way of the Useful resource Supervisor APIs, and cargo information into the RDS for PostgreSQL database. To do that, you have to configure the Lambda perform as follows:
    • VPC configuration – The Lambda perform ought to be capable to entry the EMR cluster, Price Explorer, AWS Secrets and techniques Supervisor, and Parameter Retailer. If entry shouldn’t be in place already, you are able to do this by making a digital personal cloud (VPC) that features the EMR cluster and create VPC endpoint for Parameter Retailer and Secrets and techniques Supervisor and connect it to the VPC. As a result of there is no such thing as a VPC endpoint accessible for Price Explorer and with the intention to have Lambda hook up with Price Explorer, a non-public subnet and a route desk are required to ship VPC visitors to public NAT gateway. In case your EMR cluster is in public subnet, it’s essential to create a non-public subnet together with a customized route desk and a public NAT gateway, which can enable the Price Explorer connection to movement from the VPC personal subnet. Confer with How do I arrange a NAT gateway for a non-public subnet in Amazon VPC? for setup directions and connect the newly created personal subnet to the Lambda perform explicitly.
    • IAM function – The Lambda perform must have an AWS Id and Entry Administration (IAM) function with the next permissions: AmazonEC2ReadOnlyAccess, AWSCostExplorerFullAccess, and AmazonRDSDataFullAccess. This function will probably be created routinely throughout AWS CDK stack deployment; you don’t have to set it up individually.
  • The AWS CDK ought to be put in on AWS Cloud9 (most popular) or one other growth surroundings corresponding to VSCode or Pycharm. For extra info, discuss with Stipulations.
  • The RDS for PostgreSQL database (v10 or larger) credentials ought to be saved in Secrets and techniques Supervisor. For extra info, discuss with Storing database credentials in AWS Secrets and techniques Supervisor.

Create RDS tables

Create the info mannequin tables talked about in emr-cost-rds-tables-ddl.sql by logging in to postgres rds manually into the general public schema.

Use DBeaver or any appropriate SQL purchasers to hook up with the RDS occasion and validate the tables have been created.

Deploy AWS CDK stacks

Full the steps on this part to deploy the next sources utilizing the AWS CDK:

  • Parameter Retailer to retailer required parameter values
  • IAM function for the Lambda perform to assist hook up with Amazon EMR and underlying EC2 cases, Price Explorer, CloudWatch, and Parameter Retailer
  • Lambda perform
  1. Clone the GitHub repo:
    git clone [email protected]:aws-samples/attribute-amazon-emr-costs-to-your-end-users.git

  2. Replace the next the surroundings parameters in cdk.context.json (this file will be present in the principle listing):
    1. yarn_urlYARN ResourceManager URL to learn job run logs and metrics. This URL ought to be accessible throughout the VPC the place Lambda could be deployed.
    2. tbl_applicationlogs_lz – RDS temp desk to retailer EMR software run logs.
    3. tbl_applicationlogs – RDS desk to retailer EMR software run logs.
    4. tbl_emrcost – RDS desk to seize each day EMR cluster utilization value.
    5. tbl_emrinstance_usage – RDS desk to retailer EMR cluster occasion utilization information.
    6. emrcluster_id – EMR cluster occasion ID.
    7. emrcluster_name – EMR cluster title.
    8. emrcluster_tag – Tag key assigned to EMR cluster.
    9. emrcluster_tag_value – Distinctive worth for EMR cluster tag.
    10. emrcluster_role – Service function for Amazon EMR (EMR function).
    11. emrcluster_linkedaccount – Account ID beneath which the EMR cluster is operating.
    12. postgres_rds – RDS for PostgreSQL connection particulars.
    13. vpc_id – VPC ID during which the EMR cluster is configured and the fee metering Lambda perform could be deployed.
    14. vpc_subnets – Comma-separated personal subnets ID related to the VPC.
    15. sg_id – EMR safety group ID.

The next is a pattern cdk.context.json file after being populated with the parameters:

{
  "yarn_url": "http://dummy.compute-1.amazonaws.com:8088/ws/v1/cluster/apps",
  "tbl_applicationlogs_lz": "public.emr_applications_execution_log_lz",
  "tbl_applicationlogs": "public.emr_applications_execution_log",
  "tbl_emrcost": "public.emr_cluster_usage_cost",
  "tbl_emrinstance_usage": "public.emr_cluster_instances_usage",
  "emrcluster_id": "j-xxxxxxxxxx",
  "emrcluster_name": "EMRClusterName",
  "emrcluster_tag": "EMRClusterTag",
  "emrcluster_tag_value": "EMRClusterUniqueTagValue",
  "emrcluster_role": "EMRClusterServiceRole",
  "emrcluster_linkedaccount": "xxxxxxxxxxx",
  "postgres_rds": {
    "host": "xxxxxxxxx.amazonaws.com",
    "dbname": "dbname",
    "consumer": "username",
    "secretid": "DatabaseUserSecretID"
  },
  "vpc_id": "xxxxxxxxx",
  "vpc_subnets": "subnet-xxxxxxxxxxx",
  "sg_id": "xxxxxxxxxx"
}

You’ll be able to select to deploy the AWS CDK stack utilizing AWS Cloud9 or every other growth surroundings in accordance with your wants. For directions to arrange AWS Cloud9, discuss with Getting began: fundamental tutorials for AWS Cloud9.

  1. Go to AWS Cloud9 and select File and Add native recordsdata add the venture folder.
  2. Deploy the AWS CDK stack with the next code:
    cd attribute-amazon-emr-costs-to-your-end-users/
    pip set up -r necessities.txt
    cdk deploy –-all

The deployed Lambda perform requires two exterior libraries: psycopg2 and requests. The corresponding layer must be created and assigned to the Lambda perform. For directions to create a Lambda layer for the requests module, discuss with Step-by-Step Information to Creating an AWS Lambda Operate Layer.

Creation of the psycopg2 bundle and layer is tied to the Python runtime model of the Lambda perform. Offered that the Lambda perform makes use of the Python 3.9 runtime, full the next steps to create the corresponding layer bundle for peycopog2:

  1. Obtain psycopg2_binary-2.9.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl from https://pypi.org/venture/psycopg2-binary/#recordsdata.
  2. Unzip and transfer the contents to a listing named python:
    zip ‘python’ listing

  3. Create a Lambda layer for psycopg2 utilizing the zip file.
  4. Assign the layer to the Lambda perform by selecting Add a layer within the deployed perform properties.
  5. Validate the AWS CDK deployment.

Your Lambda perform particulars ought to look just like the next screenshot.

Lambda Function Screenshot

On the Programs Supervisor console, validate the Parameter Retailer content material for precise values.

The IAM function particulars ought to look just like the next code, which permits the Lambda perform entry to Amazon EMR and underlying EC2 cases, Price Explorer, CloudWatch, Secrets and techniques Supervisor, and Parameter Retailer:

{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Action": [
        "ce:GetCostAndUsage",
        "ce:ListCostAllocationTags",
        "ec2:AttachNetworkInterface",
        "ec2:CreateNetworkInterface",
        "ec2:DeleteNetworkInterface",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeInstances",
        "ec2:DescribeNetworkInterfaces",
        "elasticmapreduce:Describe*",
        "elasticmapreduce:List*",
        "ssm:Describe*",
        "ssm:Get*",
        "ssm:List*"
      ],
      "Useful resource": "*",
      "Impact": "Enable"
    },
    {
      "Motion": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:DescribeLogStreams",
        "logs:PutLogEvents"
      ],
      "Useful resource": "arn:aws:logs:*:*:*",
      "Impact": "Enable"
    },
    {
      "Motion": "secretsmanager:GetSecretValue",
      "Useful resource": "arn:aws:secretsmanager:*:*:*",
      "Impact": "Enable"
    }
  ]
}

Check the answer

To check the answer, you may run a Spark job that mixes a number of recordsdata within the EMR cluster, and you are able to do this by creating separate steps throughout the cluster. Confer with Optimize Amazon EMR prices for legacy and Spark workloads for extra particulars on add the roles as steps to EMR cluster.

  1. Use the next pattern command to submit the Spark job (emr_union_job.py).
    It takes in three arguments:
    1. <input_full_path> – The Amazon S3 location of the info file that’s learn in by the Spark job. The trail shouldn’t be modified. The input_full_path is s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet
    2. <output_path> – The S3 folder the place the outcomes are written to.
    3. <variety of copies to be unioned> – By altering the enter to the Spark job, you may make certain the job runs for various quantities of time and in addition change the variety of Spot nodes used.
spark-submit --deploy-mode cluster --name HR_PAYROLL_PS_PSPROD_TAXDUDUCTION_DLY_LD s3://aws-blogs-artifacts-public/artifacts/BDB-2997/scripts/emr_union_job.py s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet s3://<output_bucket>/<output_path>/ 6

spark-submit --deploy-mode cluster --name FIN_CASHRECEIPT_GL_GLDB_MAIN_DLY_LD s3://aws-blogs-artifacts-public/artifacts/BDB-2997/scripts/emr_union_job.py s3://aws-blogs-artifacts-public/artifacts/BDB-2997/sample-data/enter/part-00000-a0885743-e0cb-48b1-bc2b-05eb748ab898-c000.snappy.parquet s3://<output_bucket>/<output_path>/ 12

The next screenshot reveals the log of the steps run on the Amazon EMR console.

EMR Steps Execution

  1. Run the deployed Lambda perform from the Lambda console. This masses the each day software log, EMR greenback utilization, and EMR occasion utilization particulars into their respective RDS tables.

The next screenshot of the Amazon RDS question editor reveals the outcomes for public.emr_applications_execution_log.

public.emr_applications_execution_log

The next screenshot reveals the outcomes for public.emr_cluster_usage_cost.

public.emr_cluster_usage_cost

The next screenshot reveals the outcomes for public.emr_cluster_instances_usage.

public.emr_cluster_instances_usage

Price will be calculated utilizing the previous three tables primarily based in your necessities. Within the following SQL question, you calculate the fee primarily based on relative utilization of all purposes in a day. You first determine the entire vcore-seconds CPU consumed in a day after which discover out the share share of an software. This drives the fee primarily based on general cluster value in a day.

Contemplate the next instance state of affairs, the place 10 purposes ran on the cluster for a given day. You’ll use the next sequence of steps to calculate the chargeback value:

  1. Calculate the relative share utilization of every software (consumed vcore-seconds CPU by app/whole vcore-seconds CPU consumed).
  2. Now you have got the relative useful resource consumption of every software, distribute the cluster value to every software. Let’s assume that the entire EMR cluster value for that date is $400.
app_id app_name runtime_seconds vcore_seconds % Relative Utilization Amazon EMR Price ($)
application_00001 app1 10 120 5% 19.83
application_00002 app2 5 60 2% 9.91
application_00003 app3 4 45 2% 7.43
application_00004 app4 70 840 35% 138.79
application_00005 app5 21 300 12% 49.57
application_00006 app6 4 48 2% 7.93
application_00007 app7 12 150 6% 24.78
application_00008 app8 52 620 26% 102.44
application_00009 app9 12 130 5% 21.48
application_00010 app10 9 108 4% 17.84

A pattern chargeback value calculation SQL question is offered on the GitHub repo.

You should utilize the SQL question to create a report dashboard to plot a number of charts for the insights. The next are two examples created utilizing QuickSight.

The next is a each day bar chart.

Cost Daily Bar Chart

The next reveals whole {dollars} consumed.

Cost Pie chart

Resolution value

Let’s assume we’re calculating for an surroundings that runs 1,000 jobs each day, and we run this resolution each day:

  • Lambda prices – One run requires 30 Lambda perform invocations per 30 days.
  • Amazon RDS value – The whole variety of data within the public.emr_applications_execution_log desk for a 30-day month could be 30,000 data, which interprets to five.72 MB of storage. If we take into account the opposite two smaller tables and storage overhead, the general month-to-month storage requirement could be roughly 12 MB.

In abstract, the answer value in accordance with the AWS Pricing Calculator is $34.20/12 months, which is negligible.

Clear up

To keep away from ongoing expenses for the sources that you simply created, full the next steps:

  • Delete the AWS CDK stacks:
  • Delete the QuickSight report and dashboard, if created.
  • Run the next SQL to drop the tables:
    drop desk public.emr_applications_execution_log_lz;
    drop desk public.emr_applications_execution_log;
    drop desk public.emr_cluster_usage_cost;
    drop desk public.emr_cluster_instances_usage;

Conclusion

With this resolution, you may deploy a chargeback mannequin to attribute prices to customers and teams utilizing the EMR cluster. You can even determine choices for optimization, scaling, and separation of workloads to totally different clusters primarily based on utilization and development wants.

You’ll be able to gather the metrics for an extended period to look at developments on the utilization of Amazon EMR sources and use that for forecasting functions.

When you’ve got any ideas or questions, depart them within the feedback part.


In regards to the Authors

Raj Patel is AWS Lead Marketing consultant for Information Analytics options primarily based out of India. He makes a speciality of constructing and modernising analytical options. His background is in information warehouse/information lake – structure, growth and administration. He’s in information and analytical discipline for over 14 years.

Ramesh DPRamesh Raghupathy is a Senior Information Architect with WWCO ProServe at AWS. He works with AWS prospects to architect, deploy, and migrate to information warehouses and information lakes on the AWS Cloud. Whereas not at work, Ramesh enjoys touring, spending time with household, and yoga.

Gaurav JainGaurav Jain is a Sr Information Architect with AWS Skilled Companies, specialised in massive information and helps prospects modernize their information platforms on the cloud. He’s enthusiastic about constructing the precise analytics options to realize well timed insights and make vital enterprise choices. Exterior of labor, he likes to spend time along with his household and likes watching motion pictures and sports activities.

Dipal Mahajan is a Lead Marketing consultant with Amazon Internet Companies primarily based out of India, the place he guides world prospects to construct extremely safe, scalable, dependable, and cost-efficient purposes on the cloud. He brings intensive expertise on Software program Growth, Structure and Analytics from industries like finance, telecom, retail and healthcare.

Leave a Reply

Your email address will not be published. Required fields are marked *