Moneo

Distributed AI/HPC Monitoring Framework

MIT License

Stars
26

Bot releases are hidden (Show)

Moneo - Moneo v0.3.5 Latest Release

Published by rafsalas19 6 months ago

What's Changed

  • Bug Fixes
  • Update templates
  • Unified configuration files to one
Moneo - Moneo v0.3.4

Published by rafsalas19 9 months ago

What's Changed

  • Custom exporter via Moneo CLI
  • GPU sample rate is now adjustable
  • Update to Moneo docker. Moneo exporters launched within docker
  • Refresh Linux server deployment and docs
  • Addition frontend network naming arguments for node exporter

Full Changelog: https://github.com/Azure/Moneo/compare/v0.3.3...v0.3.4

Moneo - Moneo v0.3.3

Published by rafsalas19 about 1 year ago

What's Changed

  • Doc updates
  • Arm templates for Managed Prometheus and Grafana infrastructue
  • slurm example updates
  • bug fixes

Full Changelog: https://github.com/Azure/Moneo/compare/v0.3.2...v0.3.3

Moneo - Moneo v0.3.2

Published by rafsalas19 about 1 year ago

minor bug fixes

Moneo - Moneo v0.3.1

Published by rafsalas19 over 1 year ago

What's Changed

Add physical hardware hostname to managed Prometheus deployment method.

Moneo - v0.3.0

Published by rafsalas19 over 1 year ago

What's Changed

  • Managed Prometheus/Grafana deployment method set to preferred method for Services
  • IB Link flap metric fixes
  • Bug fixes
Moneo - Moneo v0.2.6

Published by rafsalas19 over 1 year ago

What's Changed

  • Managed Prometheus Agent Integration
  • fix bug related to accelerated network
  • Alma bugfix
Moneo - Moneo v0.2.5.

Published by rafsalas19 over 1 year ago

Moneo no longer uses Ansible and is replaced with pssh
Moneo can now be launched as a Linux service
Moneo can now publish to Geneva or Azure monitor/Log analytics.
Small bug fixes

Moneo - v0.2.4.2

Published by rafsalas19 over 1 year ago

What's Changed

  • Geneva cert authentication feature
  • Doc updates
  • Bug fixes
Moneo - Moneo v0.2.4.1

Published by rafsalas19 over 1 year ago

What's Changed

  • Moneo no longer uses Ansible and is replaced with pssh
  • Moneo can now be launched as a Linux service
  • Moneo can now publish to Geneva or Azure monitor/Log analytics.
  • Small bug fixes
Moneo - Moneo v0.2.4 experimental release

Published by rafsalas19 over 1 year ago

What's Changed

  • Moneo no longer uses Ansible and is replaced with pssh
  • Moneo can now be launched as a Linux service
  • Moneo can now publish to Geneva or Azure monitor/Log analytics.
Moneo - Moneo v0.2.3

Published by rafsalas19 almost 2 years ago

What's Changed

  • Logging to CLI and Exporter output to files:
    • CLI log location: Moneo/ moneoCLI.log
    • Exporter log location: on worker node: /tmp/moneo-worker/moneoExporter.log
  • Adding XID and Link Flap graphs to capture when these issues occur
Moneo - Moneo v0.2.2

Published by rafsalas19 almost 2 years ago

What's Changed

Moneo - Moneo v0.2.1

Published by rafsalas19 about 2 years ago

What's Changed

  • Features
    • Node exporter for metrics with lower sample rate
    • Ansible deployment can be parallelized via Moneo CLI
  • Addition of new metrics
    • CPU, Memory, and front-end network telemetry added
    • Additional metrics added to cluster view dashboard
  • Fixes and other changes
    • Pass in IB counter file as argument for IB net exporter.
    • Remove Python 2 dependencies
    • Fixes to the job update for Nvidia exporter
  • Updating documentation
    • Node/Base exporter examples of custom telemetry addition
    • Examples of Slurm prolog/epilog integration
Moneo - Moneo v0.2.0

Published by rafsalas19 about 2 years ago

Moneo is a distributed GPU system monitor for AI workflows.

Moneo orchestrates metric collection (DCGMI + Prometheus DB) and visualization (Grafana) across multi-GPU/node systems. This provides useful insights into workflow and system level characterization.

What's Changed

  • Features
    • Moneo CLI added to deploy, shutdown, and update Moneo workflow
    • Cluster wide dashboard added to provide overall view of cluster
    • Job filtering at node level
    • Update AMD exporter (Experimental)
    • Azure Application Insights Integration (Experimental)
  • Addition of new metrics
    • IB port metrics to observe link and port status
    • GPU throttling metrics
  • Fixes and other changes
    • Profiling Metrics disabled by default to avoid additional overhead
    • Skip DCGM install if already installed
  • Updating documentation
    • Quick start guide
    • Job filtering document
Moneo - v0.1.2

Published by rafsalas19 over 2 years ago

  • Fix for net exporter due to python version issues
  • added additional data (requirements and access) to the readme
Moneo - v0.1.1

Published by rafsalas19 over 2 years ago

  • Fixed hostname resolution issues allowing this now to work with Azure Cycle Cloud.
  • Fixes to account for shared home directory.
  • Fixed a small net exporter issue
Moneo - v0.1.0

Published by rafsalas19 over 2 years ago

Moneo v0.1.0 Release Notes

Moneo GPU Monitoring Framework

  • First release of Mone, includes ansible framework to coordinate docker containers and metric collection workers.
  • Grafana front end implementation.
  • Includes Prometheus DB to store metrics.
  • Reference ReadMe doc for usage instructions.
Related Projects