Sample code to collect Apache Iceberg metrics for table monitoring
MIT-0 License
This repository provides you with a sample solution that collects metrics of existing Apache Iceberg tables managed in your Amazon S3 and catalogued to AWS Glue Data Catalog. The solution consists of AWS Lambda deployment package that collects and submits metrics into AWS CloudWatch. Repository also includes helper script for deploying CloudWatch monitoring dashboard to visualize collected metrics.
pyiceberg
library and AWS Glue interactive Sessions with minimal compute to read snapshots
, partitions
and files
Apache Iceberg metadata tables with Apache Spark.Snapshot metrics
Partitions aggregated metrics
Per-partition metrics
Files aggregated metrics
This solution uses Docker as a dependency for AWS SAM CLI. To install Docker follow Docker official documentation. https://docs.docker.com/get-docker/
This solution is using AWS SAM CLI to build test and deploy AWS Lambda code that collects the Iceberg table metrics and submits them into AWS CloudWatch.
To install AWS SAM CLI follow AWS Documentation. https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html
! Important - The guidance below uses AWS Serverless Application Model (SAM) for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files:
- template.yaml
- lambda/requirements.txt
- lambda/app.py
Once you've installed Docker and SAM CLI you are ready to build the AWS Lambda. Open your terminal and run command below.
sam build --use-container
Once build is finished you can deploy your AWS Lambda. SAM will upload packaged code and deploy AWS Lambda resource using AWS CloudFormation. Run below command using your terminal.
sam deploy --guided
CWNamespace
- A namespace is a container for CloudWatch metrics.GlueServiceRole
- AWS Glue Role arn you created earlier.Warehouse
- Required catalog property to determine the root path of the data warehouse on S3. This can be any path on your S3 bucket. Not critical for the solution.In this section you will configure EventBridge Rule that will trigger Lambda function on every transaction commit to Apache Iceberg table.
Default rule listens to Glue Data Catalog Table State Change
event from all the tables in Glue Data Catalog catalog. Lambda code knows to skip non-iceberg tables.
If you want to scope triggers to specific Iceberg Tables and not collecting metrics from all of them you can uncomment glue_table_names = ["<<REPLACE TABLE 1>>", "<<REPLACE TABLE 1>>"]
and add relevant table names.
import boto3
import json
# Initialize a boto3 client
session = boto3.Session(region_name='<<SET CORRECT AWS REGION>>')
lambda_client = session.client('lambda')
events_client = session.client('events')
# Parameters
lambda_function_arn = '<<REPLACE WITH LAMBDA FUNCTION ARN>>'
glue_table_names = None
# glue_table_names = ["<<REPLACE TABLE 1>>", "<<REPLACE TABLE 1>>"]
# Create EventBridge Rule
event_pattern = {
"source": ["aws.glue"],
"detail-type": ["Glue Data Catalog Table State Change"]
}
if glue_table_names:
event_pattern
event_pattern["detail"] = {
"tableName": glue_table_names
}
event_pattern_dump = json.dumps(event_pattern)
rule_response = events_client.put_rule(
Name='IcebergTablesUpdateRule',
EventPattern=event_pattern_dump,
State='ENABLED'
)
# Add Lambda as a target to the EventBridge Rule
events_client.put_targets(
Rule='IcebergTablesUpdateRule',
Targets=[
{
'Id': '1',
'Arn': lambda_function_arn
}
]
)
print(f"Pattern updated = {event_pattern_dump}")
Once your Iceberg Table metrics are submitted to CloudWatch you can start using them to monitor and create alarms. CloudWatch also let you visualize metrics using CloudWatch Dashboards.
assets/cloudwatch-dashboard.template.json
is a sample CloudWatch dashboard configuration that uses fraction of the submitted metrics and combines it with AWS Glue native metrics for Apache Iceberg.
We use Jinja2 so you could generate your own dashboard by providing your parameters.
Run the script below to generate your own CloudWatch dashboard configuration. Replace input values with the relevant parameters from previous sections.
import json
from jinja2 import Template
def render_json_template(template_path, data):
with open(template_path, 'r') as file:
template_text = file.read()
template = Template(template_text)
rendered_json = template.render(data)
json_data = json.loads(rendered_json)
return json_data
# Data to fill in the template
data = {
"CW_NAMESPACE": "<<REPLACE>>",
"REGION": "<<REPLACE>>",
"DBNAME": "<<REPLACE>>",
"TABLENAME": "<<REPLACE>>"
}
# Path to cloudwatch template file
template_path = 'assets/cloudwatch-dashboard.template.json'
rendered_data = render_json_template(template_path, data)
output_path = 'assets/cloudwatch-dashboard.rendered.json'
with open(output_path, 'w') as file:
json.dump(rendered_data, file, indent=4)
print(f"Your dashboard configuration successfully generated at {output_path}")
Now follow steps to create CloudWatch dashboard from rendered json.
You can test the code locally on using SAM CLI. Ensure you have configured the right AWS permissions to call CloudWatch and AWS Glue.
sam local invoke IcebergMetricsLambda --env-vars .env.local.json
.env.local.json
- The JSON file that contains values for the Lambda function's environment variables. Lambda code is dependent on env vars that you are passing in the deploy section. You need to create the file it and include relevant parameters before you calling sam local invoke
.
PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. https://py.iceberg.apache.org
AWS Serverless Application Model (AWS SAM) https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html
Docker https://docs.docker.com/get-docker/
sam delete
.See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.