Distributed AI/HPC Monitoring Framework
MIT License
Moneo is a distributed GPU system monitor for AI workflows. It orchestrates metric collection (DCGMI + Prometheus DB) and visualization (Grafana) across multi-GPU/node systems. This provides useful insights into workflow and system level characterization.
Moneo offers flexibility with 3 deployment methods:
Moneo Headless Method:
There five categories of metrics that Moneo monitors:
GPU Counters
GPU Profiling Counters
InfiniBand Network Counters
CPU Counters
Memory
Menu: List of available dashboards.
Note: When viewing GPU dashboards make sure to note whether you are using Nvidia or AMD GPU nodes and select the proper dashboard.
Cluster View: contains min, max, average across devices for GPU/IB metrics per VM.
GPU Device Counters: Detailed view of node level GPU counters.
GPU Profiling Counters: Node level profiling metrics require additional overhead which may affect workload performance. Tensor, FP16, FP32, and FP64 activity are disabled by default but can be switched on by CLI command.
InfiniBand Network Counters: Detailed view of node level IB network metrics.
Node View: Detailed view of node level CPU, Memory, and Network metrics.
python >=3.7 installed
OS Support:
Note: Not applicable if using Azure Managed Grafana/Prometheus
Get the code:
Clone Moneo from Github.
# get the code
git clone https://github.com/Azure/Moneo.git
cd Moneo
# install dependency
sudo apt-get install pssh
Note: If you are using an Azure Ubuntu HPC-AI VM image you can find the Moneo in this path: /opt/azurehpc/tools/Moneo
The moneo_config.json file can be used to specify certain deployment settings prior to moneo deployment.
There are 4 groups of configurations:
The prefered way to deploy Moneo is the headless method using Azure Managaed Grafana and Prometheus resources.
Complete the steps listed here: Headless Deployment Guide
This method requires a deploying of a head node to host the local Prometheus database and Grafana server.
Complete the steps listed here: Local Grafana Deployment Guide
Moneo CLI provides an alternative way to deploy and update Moneo manager and worker nodes. Although linux services are prefered this offers an alternative way to control Moneo.
python3 moneo.py [-d/--deploy] [-c hostfile] {manager,workers,full}
python3 moneo.py [-s/--shutdown] [-c hostfile] {manager,workers,full}
python3 moneo.py [-j JOB_ID ] [-c hostfile]
python3 moneo.py -d -c ./hostfile full
Note: For more options check the Moneo help menu
python3 moneo.py --help
NVIDIA exporter may conflict with DCGMI
There're two modes for DCGM: embedded mode and standalone mode.
If DCGM is started as embedded mode (e.g., nv-hostengine -n
, using no daemon option -n
), the exporter will use the DCGM agent while DCGMI may return error.
It's recommended to start DCGM in standalone mode in a daemon, so that multiple clients like exporter and DCGMI can interact with DCGM at the same time, according to NVIDIA.
Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility and lowest maintenance cost to users.
Moneo will attempt to install a tested version of DCGM if it is not present on the worker nodes. However, this step is skipped if DCGM is already installed. In instances DCGM installed may be too old.
This may cause the Nvidia exporter to fail. In this case it is recommended that DCGM be upgraded to atleast version 2.4.4.
To view which exporters are running on a worker just run ps -eaf | grep python3
For Managed Grafana (headless) deployment
Moneo/moneo_config.json
) is configured correctly on each worker node.sudo docker logs prometheus | grep 'Done replaying WAL'
ts=2023-08-07T07:25:49.636Z caller=dedupe.go:112 component=remote level=info remote_name=6ac237 url="<ingestion_endpoint>" msg="Done replaying WAL" duration=8.339998173s
For deployments with a Headnode:
sudo docker container ls
All deployments:
Verifying exporters on worker node:
ps -eaf | grep python3
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.