This repository contains reference architecture, code sample and dashboard template for tracking Azure resources availability (uptime/downtime) trends.
This solution implements the pillars of the Microsoft Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. One of the key considerations of the solution was the reliability pillar, which ensures that your application can meet the commitments you make to your customers. To support this, this solution helps to ensure that Azure applications are consistently reliable and meet the expectations of customers. In addition, this solution considers the operational excellence pillar, ensuring that processes can keep an application running in production. Together, these pillars come together to ensure that applications remain consistently available and reliable customers. For more information about these framework pillars, see Overview of the reliability pillar and Overview of the operational excellence pillar.
This solultion allows you to track and filter resources across different tenants and subscriptions. Follow the steps in the post-installation section to set this up.
The frequency of availability data pulled can be adjusted down to one minute, or any other desired interval, by updating the MyTimeTrigger environment variable. Follow the steps in the post installation section to modify this value.
You can directly navigate to a resource's overview page in the Azure Portal by clicking on the underlined id field in the drill down menu.
The following diagram gives a high-level view of Observability solution. You may download the Visio file from here
Unlike Azure Monitor, which provides the average availability of one resource at a time, this solution provides the average availability of all resources of the same resource type in your subscriptions. For example, instead of providing the availability of one Key Vault, this solution will provide the average availability of all Key Vaults in your subscriptions.
The above diagram consists of a range of Azure components, which will be further outlined below.
Azure Data Explorer Clusters End-to-end solution for data ingestion, query, visualization, and management. Also used as the time series database for the availability metrics
Resource Graph Explorer Enables running Resource Graph queries directly in the Azure portal.
Service Bus Decouples applications and services from each other, to allow for load-balancing and safe data transfer.
Ingest Function Loads data records from one or more sources into a table in Azure Data Explorer. Once ingested, the data becomes available for query.
Grafana Azure managed Grafana to visualize the availability metrics
Azure Blob Object storage solution for the cloud. Optimized for storing massive amounts of unstructured data.
Key Vault Cloud-based service that allows you to securely store and manage cryptographic keys, secrets, and certificates used by your applications and services.
This solution calls on the Azure Monitor Batch API to pull availability data for multiple resources within a subscription in one call.
A sample request will look like this
POST "https://{region}.metrics.monitor.azure.com/subscriptions/{subscriptionID}/metrics:getBatch?timespan={timeSpan}&interval=PT1M&metricnames=Availability&aggregation=average&metricNamespace={resourceProvider}&autoadjusttimegrain=true&api-version=2023-03-01-preview"
With multiple resource IDs passed in the body of the request
{
"resourceids": [
"/subscriptions/12345678-abcd-1234-abcd-123456789abc/resourceGroups/TestGroup/providers/Microsoft.Storage/storageAccounts/TestStorage1",
"/subscriptions/12345678-abcd-1234-abcd-123456789abc/resourceGroups/TestGroup/providers/Microsoft.Storage/storageAccounts/TestStorage2"
]
}
The recommended SKU for this Kusto cluster is Standard_E8ads_v5. You can monitor this and scale up as needed by checking the application insights for the TimerStartPipelineFunction. You may see some 429 Kusto errors, meaning that your requests are being rate limited. The requests will wait some and be retried, so you should not experience data loss. However, it is best to scale up if you are seeing these errors to avoid further issues.
Additionally, ensure that your tenant does not have any policies in place that would prevent Terraform from creating a client SP secret, or you will see an error in the deployment.
In Azure services, availability refers to the percentage of time that a service or application is available and functioning as expected.
The following availability metrics are supported by Azure Monitor. This version of the solution queries only these metrics.
Resource Type | Metric Name(Azure Monitor) | Availability metric calculation |
---|---|---|
AKS Server Node | kube_node_status_condition | (Ready / (Ready + Not Ready)) x 100 |
Load Balancer | VipAvailability | - |
Firewall | FirewallHealth | - |
Storage | Availability | - |
Cosmos DB | ServiceAvailability | - |
Key Vault | Availability | - |
Event Hubs | IncomingRequests, ServerErrors | ((IncomingRequests - ServerErrors) / IncomingRequests) x 100 |
Container Registry | Successful/Total Push, Successful/Total Pull | ((Successful Push + Pull)/(Total Push + Pull)) x 100 |
Log Analytics | AvailabilityRate_Query | - |
In this section, you will see how the Grafana dashboard displays availability metrics over a given time frame for each queried resource type.
You can also drill down to a more detailed view of the monitored resources and click on the ID field to navigate to the resource's overview page in Azure Portal.
The following section describes the Prerequisites and Installation steps to deploy the solution.
Note: Azure Role Permissions: User should have access to create ManagedIdentity/Service Principal on the subscription
The script can be executed in Linux - Ubuntu 20.04 (VM, WSL).
Note: currently Azure Cloud Shell is not supported since it uses az-cli > 2.46.0
## Clone git repo into the folder
repolink=""
codePath=$"./observability"
git clone $repolink $codePath
## Please setup the following required parameters for the script to run:
## prefix - prefix string to identify the resources created with this deployment. eg: test
## subscriptionId - subscriptionId where the solution will be deployed to
## location - location where the azure resources will be created. eg: eastus
# change directory to where the repo is cloned
cd $codePath
# set current working directory
currentDir=$(pwd)
# install pre-requisites
bash $currentDir/Utils/scripts/pre-requisites.sh
#downgrade az-cli to use version < 2.46
apt-cache policy azure-cli
sudo apt-get install azure-cli=<version>-1~<Codename>
eg: sudo apt-get install azure-cli=2.46.0-1~focal (Codename - focal/bionic/bullseye etc)
# change directory to where Terraform main.tf is located
cd $currentDir/Utils/scripts/Terraform
Note: if you are deploying feature improvements on top of an existing deployment, please copy over the tfstate files from the folders resources,grafana-datasource and grafana-dashboards from your existing deployment to the cloned repository
#log in to the tenant where the subscription to host the resources is present
az login
#list the subscriptions under the tenant
az account show
#set the subscription where the resources are to be deployed
az account set --subscription <subscriptionId>
## 1. Create resources using Terraform
cd resources
#initialize terraform providers
terraform init
# run a plan on the root file
terraform plan -var="prefix=<prefix>" -var="subscriptionId=<subscriptionId>" -var="location=<preferredLocation>" -parallelism=<count>
eg: terraform plan -var="prefix=test" -var="subscriptionId=00000000-0000-0000-0000-000000000000" -var="location=eastus" -parallelism=1
# run apply on the root file
terraform apply -var="prefix=<prefix>" -var="subscriptionId=<subscriptionId>" -var="location=<preferredLocation>" -parallelism=<count>
eg: terraform apply -var="prefix=test" -var="subscriptionId=00000000-0000-0000-0000-000000000000" -var="location=eastus" -parallelism=1
note: make sure to confirm resource creation with a "yes" when the prompt appears on running this command
# add "grafana admin" role to the user as described here - https://learn.microsoft.com/en-us/azure/managed-grafana/how-to-share-grafana-workspace?tabs=azure-portal
# run post installation script to set up some additional variables
sh post_install.sh
# create api key and export all variables
export TF_VAR_database_name=$(terraform output -raw database_name)
export TF_VAR_cluster_url=$(terraform output -raw cluster_url)
export TF_VAR_sp_object_id=$(terraform output -raw sp_object_id)
export TF_VAR_prefix=$(terraform output -raw prefix)
export TF_VAR_url=$(az grafana show -g $TF_VAR_prefix-RG -n $TF_VAR_prefix-grafana -o json | jq -r .properties.endpoint)
export TF_VAR_token=$(az grafana api-key create --key `date +%s` --name $TF_VAR_prefix-grafana -g $TF_VAR_prefix-RG -r editor --time-to-live 60m -o json | jq -r .key)
## 2. Update grafana instance to create datasource, folders and dashboards using Terraform
cd ../grafana-datasource
#initialize terraform providers
terraform init -upgrade
# run a plan on the root file
terraform plan
# run apply on the root file
terraform apply
## 3. Update grafana instance to create folders and dashboards using Terraform
cd ../grafana-dashboards
#initialize terraform providers
terraform init -upgrade
# run a plan on the root file
terraform plan
# run apply on the root file
terraform apply
The solution relies on the following data to be present in the "Resource Providers" and "Subscriptions" tables before it can be used to visualize the data. Follow the steps below to complete the post installation steps.
Note: While saving to local ensure that you save the file with csv extension, the default is set to .txt
Finally, add "Monitoring Reader" role for the Managed Identity and Service Principal created by script to the subscriptions that you want to monitor within the tenant where you have deployed the solution.
Currently, the following command needs to be executed manually on the ADX cluster to enable native ingestion from storage with MSI.
.alter-merge cluster policy managed_identity "[{ 'ObjectId' : '%%%%', 'AllowedUsages' : 'NativeIngestion' }]"
The ObjectId of the ADX system-assigned identity should be inputted here. This ObjectId can be found by navigating to the ADX Cluster > Security + Networking > Identity in the Azure portal.
In order to support monitoring of additional tenants, you will have add the appropriate service principal credentials to Key Vault. Follow the steps below to create and upload the client secrets.
Creating a Service Principal: follow the steps described to create a multitenant app registration and client secret in the tenant you would like to monitor.
Add the "Monitoring Reader" role for this Service Principal to any subscriptions which you would like to monitor within the tenant.
Upload the Service Principal credentials as a secret to Key Vault in the following format:
Secret Name: tenant-[TenantId]
Secret Value: {"ClientId":"[ServicePrincipalClientId]","ClientSecret":"[ServicePrincipalSecretValue]"}
In order to update the frequency of pulling availability metrics, you can navigate to the app settings of the TimerStartPipelineFunction. Here, you can modify the MyTimeTrigger variable in environment variables as seen below to reduce to 1 minute, or your desired interval.
To add other users to view/edit the Grafana dashboard, follow adding role assignment to managed grafana
sas token - expires in a year need to update it
az grafana create not compatible with az cli versions > 2.46 ongoing issue - https://github.com/Azure/azure-cli-extensions/issues/6221, advice to use lower versions of cli <=2.46 until the issue is resolved.
please ensure you are storing the tfstate files in the following locations so that they can be used to deploy further improvements in the future
Note: for MSFT Tenant, remove the secret in key vault in your existing deployment before incremental deployment, and save it(save name and secret value). Add it back to key vault manually after incremental deployment is finished.