Laradock Monitoring Tools (using Grafana)
MIT License
A monitoring solution for Docker hosts and containers with Prometheus, Grafana, cAdvisor, NodeExporter and alerting with AlertManager.
Clone this repository on your Docker host, cd into dockprom directory and run compose up:
git clone https://github.com/zeroc0d3/monitor-laradock
cd monitor-laradock
ADMIN_USER=admin ADMIN_PASSWORD=admin docker-compose up -d
Prerequisites:
Containers:
http://<host-ip>:9090
http://<host-ip>:9091
http://<host-ip>:9093
http://<host-ip>:3000
Navigate to http://<host-ip>:3000
and login with user admin password admin. You can change the credentials in the compose file or by supplying the ADMIN_USER
and ADMIN_PASSWORD
environment variables on compose up.
Grafana is preconfigured with dashboards and Prometheus as the default data source:
Docker Host Dashboard
The Docker Host Dashboard shows key metrics for monitoring the resource usage of your server:
For storage and particularly Free Storage graph, you have to specify the fstype in grafana graph request.
You can find it in grafana/dashboards/docker_host.json
, at line 480 :
"expr": "sum(node_filesystem_free{fstype=\"btrfs\"})",
I work on BTRFS, so i need to change aufs
to btrfs
.
You can find right value for your system in Prometheus http://<host-ip>:9090
launching this request :
node_filesystem_free
Docker Containers Dashboard
The Docker Containers Dashboard shows key metrics for monitoring running containers:
Note that this dashboard doesn't show the containers that are part of the monitoring stack.
Monitor Services Dashboard
The Monitor Services Dashboard shows key metrics for monitoring the containers that make up the monitoring stack:
I've set the Prometheus retention period to 200h and the heap size to 1GB, you can change these values in the compose file.
prometheus:
image: prom/prometheus
command:
- '-storage.local.target-heap-size=1073741824'
- '-storage.local.retention=200h'
Make sure you set the heap size to a maximum of 50% of the total physical memory.
I've setup three alerts configuration files:
You can modify the alert rules and reload them by making a HTTP POST call to Prometheus:
curl -X POST http://admin:admin@<host-ip>:9090/-/reload
Monitoring services alerts
Trigger an alert if any of the monitoring targets (node-exporter and cAdvisor) are down for more than 30 seconds:
ALERT monitor_service_down
IF up == 0
FOR 30s
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Monitor service non-operational",
description = "{{ $labels.instance }} service is down.",
}
Docker Host alerts
Trigger an alert if the Docker host CPU is under high load for more than 30 seconds:
ALERT high_cpu_load
IF node_load1 > 1.5
FOR 30s
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Server under high load",
description = "Docker host is under high load, the avg load 1m is at {{ $value}}. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.",
}
Modify the load threshold based on your CPU cores.
Trigger an alert if the Docker host memory is almost full:
ALERT high_memory_load
IF (sum(node_memory_MemTotal) - sum(node_memory_MemFree + node_memory_Buffers + node_memory_Cached) ) / sum(node_memory_MemTotal) * 100 > 85
FOR 30s
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Server memory is almost full",
description = "Docker host memory usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.",
}
Trigger an alert if the Docker host storage is almost full:
ALERT hight_storage_load
IF (node_filesystem_size{fstype="aufs"} - node_filesystem_free{fstype="aufs"}) / node_filesystem_size{fstype="aufs"} * 100 > 85
FOR 30s
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Server storage is almost full",
description = "Docker host storage usage is {{ humanize $value}}%. Reported by instance {{ $labels.instance }} of job {{ $labels.job }}.",
}
Docker Containers alerts
Trigger an alert if a container is down for more than 30 seconds:
ALERT jenkins_down
IF absent(container_memory_usage_bytes{name="jenkins"})
FOR 30s
LABELS { severity = "critical" }
ANNOTATIONS {
summary= "Jenkins down",
description= "Jenkins container is down for more than 30 seconds."
}
Trigger an alert if a container is using more than 10% of total CPU cores for more than 30 seconds:
ALERT jenkins_high_cpu
IF sum(rate(container_cpu_usage_seconds_total{name="jenkins"}[1m])) / count(node_cpu{mode="system"}) * 100 > 10
FOR 30s
LABELS { severity = "warning" }
ANNOTATIONS {
summary= "Jenkins high CPU usage",
description= "Jenkins CPU usage is {{ humanize $value}}%."
}
Trigger an alert if a container is using more than 1,2GB of RAM for more than 30 seconds:
ALERT jenkins_high_memory
IF sum(container_memory_usage_bytes{name="jenkins"}) > 1200000000
FOR 30s
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Jenkins high memory usage",
description = "Jenkins memory consumption is at {{ humanize $value}}.",
}
The AlertManager service is responsible for handling alerts sent by Prometheus server. AlertManager can send notifications via email, Pushover, Slack, HipChat or any other system that exposes a webhook interface. A complete list of integrations can be found here.
You can view and silence notifications by accessing http://<host-ip>:9093
.
The notification receivers can be configured in alertmanager/config.yml file.
To receive alerts via Slack you need to make a custom integration by choose incoming web hooks in your Slack team app page. You can find more details on setting up Slack integration here.
Copy the Slack Webhook URL into the api_url field and specify a Slack channel.
route:
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
text: "{{ .CommonAnnotations.description }}"
username: 'Prometheus'
channel: '#<channel>'
api_url: 'https://hooks.slack.com/services/<webhook-id>'
The pushgateway is used to collect data from batch jobs or from services.
To push data, simply execute:
echo "some_metric 3.14" | curl --data-binary @- http://user:password@localhost:9091/metrics/job/some_job
Please replace the user:password
part with your user and password set in the initial configuration (default: admin:admin
).