netdata

The open-source observability platform everyone needs!

GPL-3.0 License

Stars
68.6K
Committers
630

Bot releases are hidden (Show)

netdata - v1.35.0

Published by Ferroin over 2 years ago

Table of contents

❗ We're keeping our codebase healthy by removing features that are end of life. Read the deprecation notice to check if you are affected.

Netdata open-source Agent statistics

  • 7.6M+ troubleshooters monitor with Netdata
  • 1.3M+ unique nodes currently live
  • 3.3k+ new nodes per day
  • Over 556M Docker pulls all-time total

Release highlights

Anomaly Advisor & on-device Machine Learning

We are excited to launch one of our flagship machine learning (ML) assisted troubleshooting features in Netdata: the Anomaly Advisor.

Netdata now comes with on-device ML! Unsupervised ML models are trained for every metric, at the edge (on your devices), enabling real time anomaly detection across your infrastructure.

image

This feature is part of a broader philosophy we have at Netdata when it comes to how we can leverage ML-based solutions to help augment and assist traditional troubleshooting workflows, without having to centralize all your data.

The new Anomalies tab quickly lets you find periods of time with elevated anomaly rates across all of your nodes. Once you highlight a period of interest, Netdata will generate a ranked list of the most anomalous metrics across all nodes in the highlighted timeframe. The goal is to quickly let you find periods of abnormal activity in your infrastructure and bring to your attention the metrics that were most anomalous during that time.

In our latest release, we improved the usability of Anomaly Advisor and also ensured that the anomalous metrics are always relevant to the time period you are investigating.

A great deal of care has gone into ensuring that ML running on your device is as light weight in terms of resource consumption as possible. For instance, metrics that do not have sufficient data for training and metrics that are consistently constant during training periods are considered to be "normal" until their behavior changes significantly to require re-training of the ML models.

To use this feature, please enable ML on your agent and then navigate to the "Anomalies" tab in Netdata cloud. Update netdata.conf with the following information to enable ML on your agent:

[ml]
    enabled = yes

Read more about Anomaly Advisor at our blog.

Metrics Correlation on Agent

Metric Correlations allow you to quickly find metrics and charts related to a particular window of interest that you want to explore further. Metric correlations compare two adjacent windows to find how they relate to each other, and then score all metrics based on this rating, providing a list of metrics that may have influence or have been influenced by the highlighted one.

Metric Correlation was already available in Netdata Cloud, but now we are releasing a version implemented at the Netdata Agent, which drastically reduces the time required for to run. This means the metric correlation can now run almost instantly (more than 10x faster than before)!

To enable the new metric correlation at the Netdata Agent, set the following in your netdata.conf file:

[global]
    enable metric correlations = yes

Kubernetes monitoring

On very busy Kubernetes clusters where hundreds of containers spawn and are destroyed all the time, Netdata was consuming a lot of resources and was slow to detect changes and under certain conditions it missed certain containers.

Now, Netdata:

  1. Detects "pause" containers and skips them greatly improving the performance during discovery
  2. Detects containers that are initializing and postpones discovery for them until they are properly initialized
  3. Utilizes less resources more efficiently during container discovery

Netdata is also capable of detecting the network interfaces that have been allocated to containers, by spawning a process that switches network namespace and identifies virtual interfaces that belong to each container. This process is improved drastically, now requiring 1/3 of the CPU resources it needed before.

Additionally, Netdata cgroups.plugin now collects CPU shares for Kubernetes containers, allowing the visualization of the Kubernetes CPU Requests (Kubernetes writes in cgroup CPU Shares the CPU Requests that have been configured for the containers).

A new option has been added in netdata.conf [plugin:cgroup] section, to allow filtering containers by (resolved) name. It matches the name of the cgroup (as you see it on the dashboard).

We have also released a blog post and a video about CPU Throttling in Kubernetes. You will be amazed by our findings. Read the blog and watch the video about Kubernetes CPU throttling.

Visualization improvements

Netdata Cloud dashboards are now a lot faster in aggregating data from multiple agents, as the protocol between agents and the Cloud is approaching its final shape.

New look for Netdata charts

Netdata Cloud has a new look and feel for charts, which resembles the look and feel for coding IDEs:

image

New home for war rooms

The new home tab for war rooms allows you to quickly inspect the most important metrics for every war room, like number of nodes, metrics, retention, replication, alerts, users, custom dashboards, etc.

Time units

Time units now in charts auto-scale from microseconds to days, automatically based on the value of time to be shown.

Cloud queries timeout

The agent now sets a timeout on every query it sends to the agents, and the agents now respect this timeout. Previously, the cloud was timing out because of a slow query, but the agents remained busy executing that query, which had a waterfall effect on the agent load.

Custom dashboards

Custom dashboards on Netdata Cloud can now be renamed.

Alerts management

All configured alerts on the Cloud

We have added a new Alert Configs sub tab which lists all the alerts configured on all the nodes belonging to the war room. You have now a possibility of listing the alerts configured in the - war room, nodes and alert instances respectively.

Stale alerts

There have been a number of corner cases under which alerts could remain raised on Netdata cloud. We identified all such cases, and now Netdata Cloud is always in sync with Netdata agents about their alerts.

Nodes management

Cloud provider metadata

Netdata now identifies the Cloud provider node type it runs on. It works for GCP and AWS, and exposes this information at the Nodes tab, the single node dashboard, and the node inspector.

Virtualization detection fixes

We improved the virtualization detection in cases where systemd is not available. Now Netdata can properly detect virtualization even in these cases.

Global nodes filter on all tabs of a space

The new Netdata Cloud now supports a global filter on nodes of war rooms. The new filter is applied on every tab for each room, allowing users to quickly switch between tabs while retaining the nodes filtered.

Obsoletion of nodes

Netdata admin users now have the ability to remove obsolete nodes from a space. Many users have been eagerly waiting for this feature, and we thank you for your patience. We hope you will be happy to use the feature and have cleaner spaces and war rooms. A few notes to be considered:

  • Only admin users have the ability to obsolete nodes
  • Only offline nodes can be marked obsolete (Live nodes and stale nodes cannot be obsoleted)
  • Node obsoletion works across the entire space, so the obsoleted node will be removed from all rooms belonging to the space
  • If the obsoleted nodes eventually become live or online once more, they will be automatically re-added to the space

StatsD improvements

Every Netdata Agent is a StatsD server, listening on localhost port 8125, both TCP and UDP. You can use the Netdata StatsD server to quickly visualize metrics from scripts, Cron Job, and local applications.

In this release, the Netdata StatsD server has been improved to use Judy arrays for indexing the collected metrics, drastically improving its performance.

At the same time we extended the StatsD protocol to support dictionaries . Dictionaries are similar to sets, but instead of reporting only the number of unique entries in the set, dictionaries create a counter for each of the values and report the number of occurrences for each unique event. So, to quickly get a break down of events, you can push them to StatsD like myapp.metric:EVENT|d. StatsD will create a chart for myapp.metric and for each unique EVENT it will create a dimension with the number of times this events was encountered.

We also added the ability to change the units of the chart and the family of the chart, using StatsD tags, like this: myapp.metric:EVENT|d|#units=events/s.

Finally, StatsD now automatically creates a dashboard section for every StatsD application name. Following StatsD best practices, these application names are considered to be the first keyword of collected metrics. For example, by pushing the metric myapp.metric:1|c, StatsD will create the dashboard section "StatsD myapp".

Read more at the Netdata StatsD documentation. A real-life example of using Netdata StatsD from a shell script pushing in realtime metric to a local Netdata Agent, is available at this stress-with-curl.sh gist.

3x faster agent queries

Netdata dashboards refresh all visible charts in parallel, utilizing all the resources the web browsers provide to quickly present the required charts. Since Netdata only stores metric data at the agents, all these queries are executed in parallel at the agents.

This parallelism of queries is even more intense when metrics replication/streaming is configured. In these cases, parent Netdata agents centralize metric data from many agents, and, since Netdata Cloud prefers the more distant parents for queries, they receive quite a few queries in parallel for all their children.

We also reworked many parts of the query engine of Netdata agents to achieve top performance in parallel queries. Now, Netdata agents are able to perform queries at a rate of more than 30 million points per second, per core on modern hardware. On a parent Netdata agent with a 24-core CPU we observed a sustained rate of 1.3 billion points per second! This is 3 times faster compared to the previous release.

To achieve this performance improvements we worked in these areas:

Query memory management

When querying metric data, a lot of memory allocations need to happen. Although Netdata agents automatically adapt their memory requirements for data collection avoiding memory operations while iterating to collect data, unfortunately at the query engine site, this is not feasible.

To make the agent more efficient for queries, the number of system calls allocating memory had to be drastically decreased. So, we developed a One Way Allocator (OWA), a system that works like a scratchpad for memory allocations. When the query starts, we now predict the amount of memory needed to execute the query. The query engine still does all the individual allocations, but all these are now made against the scratchpad, not against the system. OWA is smart enough to increase the size of the scratchpad if needed during querying. And it frees all memory at once without the need for individual memory releases.

For huge data queries, the benefit is astonishing. For certain heavy data queries, 45000 memory allocations before are down to 20 with this release! This doubled the performance of the query engine.

Number unpacking

To optimize its memory footprint for metric data, Netdata agents store collected metric data into a fixed step database (after interpolation) with a custom floating point number format we developed (we call it storage_number), requiring just 4 bytes per data collection point, including the timestamp. When on disk, mainly due to compression, Netdata's dbengine needs just 0.34 bytes per point (including all metadata), which is probably the best among all monitoring solutions available today, allowing Netdata to massively store and manage metric data at a very high rate.

This means however, that in order to actually use a point in a query, we have to unpack it. This unpacking happens point-by-point even for data cached in memory. 1 billion points in a data query, 1 billion numbers unpacked.

In this release we analyzed the CPU cache efficiency of the number unpacking and we refactored it to make the best use of available CPU caches to finally increase its performance by 30%.

Streaming

This release includes a better algorithm to pick the available parent to stream metrics to. The previous version was always reconnecting to the first available parent. Now it rotates them, one by one and then restarts.

An issue was fixed regarding parents with stale alerts from disconnected children. Now, the parent validates all alerts on every child re-connection.

Netdata parents now have a timeout to cleanup dead/abandoned children connections automatically.

We also worked to eliminate most of the bottlenecks when multiple children connect to the same parent. But this is still under testing, so it will make it in the next release.

More optimizations

Workers optimizations

Netdata uses many workers to execute several of its features. There are web workers, aclk workers, dbengine
workers, health monitoring workers, libuv workers, and many more.

We manage to identify a lot of deadlocks happening that slowed down the whole operation. We also
increased the amount of workers to deliver more capacity on busy parents.

There is a new section for monitoring Netdata workers at the "Netdata Monitoring" section of the dashboard. Using this
work we are still working to make them even more efficient.

Deadlocks

The last release was hindered by rare deadlocks on very busy parents. These deadlocks are now gone, improving the agents ability to centralize data from many children.

Dictionaries are now using Judy arrays

Judy arrays are probably the fastest and most CPU cache-friendly indexes available. Netdata already uses them for
dbengine and its page cache. Now all Netdata dictionaries are using them too, giving a performance boost to all
dictionary operations, including StatsD.

/proc collectors are now a lot faster >

Initialization of /proc collectors was suboptimal, because they had to go over a slow process or adapting their read
buffers. We added a forward-looking algorithm to optimize this initialization, which now happens in 1/10th of the
time.

/proc/netdev collector is now isolated

Some users have experiences gaps in /proc plugin charts. We identified that these gaps were triggered by the netdev module, which were cause the whole plugin to slow down and miss data collection iterations.

Now the netdev module of /proc plugin runs on its own thread to avoid this influencing the rest of the /proc
modules.

Internal Web Server optimizations

The internal web server of Netdata now spreads the work among its worker threads more evenly, utilizing as much of the
parallelism that is available to it.

Options in netdata.conf re-organized

We re-organized the [global] section of the netdata.conf, so that it is more meaningful for new users. The new
configurations are backward compatible. So, after you restart netdata with your old netdata.conf, grab the new one
from http://localhost:19999/netdata.conf to have the new format.

New MQTT Client - Tech Preview

We now have our own MQTT implementation within our ACLK protocol that will eventually replace the current MQTT-C client
for several reasons, including the following:

  • With the new MQTT implementation we now support MQTTv5 as our older implementation only supported MQTTv3
  • Reduce memory usage - no need for large fixed size buffers to be allocated all the time
  • Reduce memory copying - no need to copy message contents multiple times
  • Remove max message size limit
  • Remove issues where big messages are starving other messages

Currently, it’s provided as a tech preview, and it’s disabled by default. Feel free to have some fun with the new
implementation. This is how to enable it in netdata.conf:

[cloud]
    mqtt5 = yes

Acknowledgments

  • @JaphethLim for adding priority to Gotify notifications.
  • @MarianSavchuk for adding Alma and Rocky distros as CentOS compatibility distro in
    netdata-updater.
  • @aberaud for working on configurable storage engine.
  • @atriwidada for improving package dependency.
  • @coffeegrind123 for adding Gotify notification method.
  • @eltociear for fixing "GitHub" spelling in docs.
  • @fqx for adding tailscaled to apps_groups.conf.
  • @k0ste for updating net, aws, and ha groups in apps_groups.conf.
  • @kklionz for fixing a compilation warning.
  • @olivluca for fixing appending logs to the old log file after logrotate on Debian.
  • @petecooper for improving the usage message in netdata-installer.
  • @simon300000 for adding caddy to apps_groups.conf.

Contributions

Collectors

New

  • Add "UPS Load Usage" in Watts chart (charts.d/apcupsd) (#12965, @ilyam8)
  • Add Pressure Stall Information stall time charts (proc.plugin, cgroups.plugin) (#12869, @ilyam8)
  • Add "CPU Time Relative Share" chart when running inside a K8s cluster (cgroups.plugin) (#12741, @ilyam8)
  • Add a collector that parses the log files of the OpenVPN server (go.d/openvpn_status_log) (#675, @surajnpn)

Improvements

⚙️ Enhancing our collectors to collect all the data you need.

  • Add Tailscale apps_groups.conf (apps.plugin) (#13033, @fqx)
  • Skip collecting network interface speed and duplex if carrier is down (proc.plugin) (#13019, @vlvkobal)
  • Run the /net/dev module in a separate thread (proc.plugin) (#12996, @vlvkobal)
  • Add dictionary support to statsd (#12980, @ktsaou)
  • Add an option to filter the alarms (python.d/alarms) (#12972, @andrewm4894)
  • Update net, aws, and ha groups in apps_groups.conf (apps.plugin) (#12921, @k0ste)
  • Add k8s_cluster_name label to cgroup charts in K8s on GKE (cgroups.plugin) (#12858, @ilyam8)
  • Exclude Proxmox bridge interfaces (proc.plugin) (#12789, @ilyam8)
  • Add filtering by cgroups name and improve renaming in K8s (cgroups.plugin) (#12778, @ilyam8)
  • Execute the renaming script only for containers in K8s (cgroups.plugin) (#12747, @ilyam8)
  • Add k8s_qos_class label to cgroup charts in K8s (cgroups.plugin) (#12737, @ilyam8)
  • Reduce the CPU time required for cgroup-network-helper.sh (cgroups.plugin) (#12711, @ilyam8)
  • Add Proxmox VE processes to apps_groups.conf (apps.plugin) (#12704, @ilyam8)
  • Add Caddy to apps_groups.conf (apps.plugin) (#12678, @simon300000)

Bug fixes

🐞 Improving our collectors one bug fix at a time.

  • Fix adding wrong labels to cgroup charts (cgroups.plugin) (#13062, @ilyam8)
  • Fix cpu_guest chart context (apps.plugin) (#12983, @ilyam8)
  • Fix counting unique values in Sets (statsd.plugin) (#12963, @ktsaou)
  • Fix collecting data from uninitialized containers in K8s (cgroups.plugin) (#12912, @ilyam8)
  • Fix CPU-specific data in the "C-state residency time" chart dimensions (proc.plugin) (#12898, @vlvkobal)
  • Fix memory usage calculation by considering ZFS ARC as cache on FreeBSD (freebsd.plugin)(#12879, @vlvkobal)
  • Fix disabling K8s pod/container cgroups when fail to rename them (cgroups.plugin) (#12865, @ilyam8)
  • Fix memory usage calculation by considering ZFS ARC as cache on Linux (proc.plugin) (#12847, @ilyam8)
  • Fix adding network interfaces when the cgroup proc is in the host network namespace (cgroups.plugin) (#12788, @ilyam8)
  • Fix not setting chart units (go.d/snmp) (#682, @ilyam8)
  • Fix not collecting Integer type values (go.d/snmp) (#680, @surajnpn)

eBPF

Health

Streaming

  • Improve failover logic when the Agent is configured to stream to multiple destinations (#12866, @MrZammler)
  • Increase the default "buffer size bytes" to 10MB (#12913, @ilyam8)

Exporting

  • Add the URL query parameter that filters charts from the /allmetrics API query (#12820, @vlvkobal)
  • Make the "send charts matching" option behave the same as the "filter" URL query parameter for prometheus format (#12832, @ilyam8)

Documentation

📄 Keeping our documentation healthy together with our awesome community.

Packaging / Installation

📦 "Handle with care" - Just like handling physical packages, we put in a lot of care and effort to publish beautiful
software packages.

  • Add Alma Linux 9 and RHEL 9 support to CI and packaging (#13058, @Ferroin)
  • Fix handling of temp directory in kickstart when uninstalling (#13056, @Ferroin)
  • Only try to update repo metadata in updater script if needed (#13009, @Ferroin)
  • Use printf instead of echo for printing collected warnings in kickstart (#13002, @Ferroin)
  • Don't kill Netdata PIDs if successfully stopped Netdata in installer/uninstaller (#12982, @ilyam8)
  • Properly handle the case when 'tput colors' does not return a number in kickstart (#12979, @ilyam8)
  • Update libbpf version to v0.8.0 (#12945, @thiagoftsm)
  • Update default fping version to 5.1 (#12930, @ilyam8)
  • Update go.d.plugin version to v0.32.3 (#12862, @ilyam8)
  • Autodetect channel for specific version in kickstart (#12856, @maneamarius)
  • Fix "Bad file descriptor" error in netdata-uninstaller (#12828, @maneamarius)
  • Add support for installing static builds on systems without usable internet connections (#12809, @Ferroin)
  • Add --repositories-only option to kickstart (#12806, @maneamarius)
  • Rename --install option for kickstart.sh (#12798, @maneamarius)
  • Fix to avoid recompiling protobuf all the time (#12790, @ktsaou)
  • Fix non-interpreted new lines when printing deferred errors in netdata-installer (#12786, @ilyam8)
  • Fix a typo in the warning() function in netdata-installer (#12781, @ilyam8)
  • Fix checking of environment file in netdata-updater (#12768, @Ferroin)
  • Add a missing function and Alma and Rocky distros as CentOS compatibility distro to netdata-updater (#12757, @MarianSavchuk)
  • Improve the usage message in netdata-installer (#12755, @petecooper)
  • Make atomics a hard-dependency (#12730, @vkalintiris)
  • Add --install-version flag for installing specific Netdata version to kickstart (#12729, @maneamarius)
  • Correctly propagate errors and warnings up to the kickstart script from scripts it calls (#12686, @Ferroin)
  • Fix not-respecting of NETDATA_LISTENER_PORT in docker healthcheck (#12676, @ilyam8)
  • Add options to kickstart for explicitly passing options to installer code (#12658, @Ferroin)
  • Improve handling of release channel selection in kickstart (#12635, @Ferroin)
  • Treat auto-updates as a tristate internally in the kickstart script (#12634, @Ferroin)
  • Include proper package dependency (#12518, @atriwidada)
  • Fix appending logs to the old log file after logrotate on Debian (#9377, @olivluca)

Other Notable Changes

Improvements

⚙️ Greasing the gears to smoothen your experience with Netdata.

  • Add hostname to mirrored hosts int the /api/v1/info endpoint (#13030, @ktsaou)
  • Optimize query engine queries (#12988, @ktsaou)
  • Optimize query engine and cleanup (#12978, @ktsaou)
  • Improve the web server work distribution across worker threads (#12975, @ktsaou)
  • Check link local address before querying cloud instance metadata (#12973, @ilyam8)
  • Speed up query engine by refactoring rrdeng_load_metric_next() (#12966, @ktsaou)
  • Optimize the dimensions option store to the metadata database (#12952, @stelfrag)
  • Add detailed dbengine stats (#12948, @ktsaou)
  • Stream Metric Correlation version to parent and advertise Metric Correlation status to the Cloud (#12940, @MrZammler)
  • Move directories, logs, and environment variables configuration options to separate sections (#12935, @ilyam8)
  • Adjust the dimension liveness status check (#12933, @stelfrag)
  • Make sqlite PRAGMAs user configurable (#12917, @ktsaou)
  • Add worker jobs for cgroup-rename, cgroup-network and cgroup-first-time (#12910, @ktsaou)
  • Return stable or nightly based on version if the file check fails (#12894, @stelfrag)
  • Take into account the in queue wait time when executing a data query (#12885, @stelfrag)
  • Add fixes and improvements to workers library (#12863, @ktsaou)
  • Pause alert pushes to the cloud (#12852, @MrZammler)
  • Allow to use the new MQTT 5 implementation (#12838, @underhood)
  • Set a page wait timeout and retry count (#12836, @stelfrag)
  • Allow external plugins to create chart labels (#12834, @ilyam8)
  • Reduce the number of messages written in the error log due to out of bound timestamps (#12829, @stelfrag)
  • Cleanup the node instance table on startup (#12825, @stelfrag)
  • Accept a data query timeout parameter from the cloud (#12823, @stelfrag)
  • Write the entire request with parameters in the access.log file (#12815, @stelfrag)
  • Add a parameter for how many worker threads the libuv library needs to pre-initialize (#12814, @stelfrag)
  • Optimize linking of foreach alarms to dimensions (#12813, @vkalintiris)
  • Add a hyphen to the list of available characters for chart names (#12812, @ilyam8)
  • Speed up queries by providing optimization in the main loop (#12811, @ktsaou)
  • Add workers utilization charts for Netdata components (#12807, @ktsaou)
  • Fill missing removed events after a crash (#12803 , @MrZammler)
  • Speed up buffer increases (minimize reallocs) (#12792, @ktsaou)
  • Speed up reading big proc files (#12791, @ktsaou)
  • Make dbengine page cache undumpable and dedupuble (#12765, @ilyam8)
  • Speed up execution of external programs (#12759, @ktsaou)
  • Remove per chart configuration (#12728, @vkalintiris)
  • Check for chart obsoletion on children re-connections (#12707, @MrZammler)
  • Add a 2 minute timeout to stream receiver socket (#12673, @MrZammler)
  • Improve Agent cloud chart synchronization (#12655, @stelfrag)
  • Add the ability to perform a data query using an offline node id (#12650, @stelfrag)
  • Implement ks_2samp test for Metric Correlations (#12582, @MrZammler)
  • Reduce alert events sent to the cloud (#12544, @MrZammler)
  • Store alert log entries even if alert it is repeating (#12226, @MrZammler)
  • Improve storage number unpacking by using a lookup table (#11048, @vkalintiris)

Bug fixes

🐞 Increasing Netdata's reliability one bug fix at a time.

Code organization

🏋️ Changes to keep our code base in good shape.

Deprecation notice

The following items will be removed in our next minor release (v1.36.0):

Patch releases (if any) will not be affected.

Component Type Will be replaced by
python.d/chrony collector go.d/chrony
python.d/ovpn_status_log collector go.d/openvpn_status_log

All the deprecated components will be moved to the netdata/community repository.

Deprecated in this release

In accordance with our previous deprecation notice, the following items have been removed in this release:

Component Type Replaced by
node.d plugin -
node.d/snmp collector go.d/snmp
python.d/apache collector go.d/apache
python.d/couchdb collector go.d/couchdb
python.d/dns_query_time collector go.d/dnsquery
python.d/dnsdist collector go.d/dnsdist
python.d/elasticsearch collector go.d/elasticsearch
python.d/energid collector go.d/energid
python.d/freeradius collector go.d/freeradius
python.d/httpcheck collector go.d/httpcheck
python.d/isc_dhcpd collector go.d/isc_dhcpd
python.d/mysql collector go.d/mysql
python.d/nginx collector go.d/nginx
python.d/phpfpm collector go.d/phpfpm
python.d/portcheck collector go.d/portcheck
python.d/powerdns collector go.d/powerdns
python.d/redis collector go.d/redis
python.d/web_log collector go.d/weblog

Platform Support Changes

This release adds official support for the following platforms:

  • RHEL 9.x, Alma Linux 9.x, and other compatible RHEL 9.x derived platforms
  • Alpine Linux 3.16

This release removes official support for the following platforms:

  • Fedora 34 (support ended due to upstream EOL).
  • Alpine Linux 3.12 (support ended due to upstream EOL).

This release includes the following additional platform support changes.

  • We’ve switched from Alpine 3.15 to Alpine 3.16 as the base for our Docker images and static builds. This should not
    require any action on the part of users, and simply represents a version bump to the tooling included in our Docker
    images and static builds.
  • We’ve switched from Rocky Linux to Alma Linux as our build and test platform for RHEL compatible systems. This will
    enable us to provide better long-term support for such platforms, as well as opening the possibility of better support
    for non-x86 systems.

Netdata Agent Release Meetup

Join the Netdata team on the 9th of June at 5pm UTC for the Netdata Agent Release Meetup, which will be held on
the Netdata Discord.

Together we’ll cover:

  • Release Highlights
  • Acknowledgements
  • Q&A with the community

RSVP now - we look forward to
meeting you.

Support options

As we grow, we stay committed to providing the best support ever seen from an open-source solution. Should you encounter
an issue with any of the changes made in this release or any feature in the Netdata Agent, feel free to contact us
through one of the following channels:

  • Netdata Learn: Find documentation, guides, and reference material for monitoring and
    troubleshooting your systems with Netdata.
  • GitHub Issues: Make use of the Netdata repository to report bugs or open
    a new feature request.
  • GitHub Discussions: Join the conversation around the Netdata
    development process and be a part of it.
  • Community Forums: Visit the Community Forums and contribute to the collaborative
    knowledge base.
  • Discord: Jump into the Netdata Discord and hangout with like-minded sysadmins,
    DevOps, SREs and other troubleshooters. More than 1100 engineers are already using it!
netdata - v1.32.0

Published by Ferroin almost 3 years ago

Release v1.32.0

The newest version of Netdata, v.1.32.0, propels us toward the end of the year, and the Netdata community is positioned to grow stronger than ever in 2022. Before we get into specifics of the new release, it's worth reflecting on that growth.

Netdata open-source Agent growth

The open-source Netdata Agent, the best OSS node monitoring and troubleshooting ever, currently has:

  • 1,000,000 unique Netdata nodes live!
  • 330,000 engineers using the agent per month!
  • Our open-source community growing at an amazing rate, with 3,000 new nodes and 8,000 users per day!
  • 250,000 Docker pulls per day with 360 million total, according to DockerHub!

Netdata Cloud growth

The Netdata Cloud, our infrastructure-level, distributed, real-time monitoring and troubleshooting orchestrator, is also showing similar growth, with:

  • 35,000 live Netdata nodes!
  • 90,000 engineers signed up with 200 new sign-ups every day!
  • 180 new spaces created every day!

We are not just pleased with this amazing adoption rate, we are inspired by it. It is you users who give us the energy and confidence to move forward into a new era of high-fidelity, real-time monitoring and troubleshooting, made accessible to everyone!

Thank you for the inspiration! You rock!

Community News

As many of you know, even though we are not endorsed by CNCF, Netdata is the fourth most starred project in the CNCF landscape. We want to thank you for this expression of your appreciation. If you love Netdata and haven't yet, consider giving us a Github star.

Additionally, we invite you to join us on our new Discord server to continue our growth and trajectory, but also to join in on fun and informative live conversations with our wonderful community.

v1.32.0 at a glance

The following offers a high-level overview of some of the key changes made in this release, with more detailed description available in subsequent sections.

New Cloud backend and Agent communication protocol
This Agent release supports our new Cloud backend. From here, we will be offering much faster and simpler communication, reliable alerts and exchange of metadata, and first-time support for the parent-child relationship of Netdata agents. This is the first Agent release that allows Netdata Cloud to use the Netdata Agent as a distributed time-series database that supports replication and query routing, for every metric!

eBPF latency monitoring, container monitoring, and more
We use eBPF to monitor all running processes, without the cooperation of the processes and without sniffing data traffic. This new release includes 13 new eBPF monitoring features, including I/O latency, BTRFS, EXT4, NFS, XFS and ZFS latencies, IRQs latencies, extended swap monitoring, and more.

Machine learning (ML) powered anomaly detection
​This release links Netdata Agent with dlib, the popular C++ machine learning algorithms library, which we use to automatically detect anomalies out-of-the-box, at the edge! Once enabled, Netdata trains an ML model for every metric, which is then used to detect outliers in real-time. The resulting "anomaly bit" (where 0=normal, 1=anomalous) associated with each database entry is stored alongside the raw metric value with zero additional storage overhead! This feature is still in development, so it is disabled by default. If you would like to test it and provide feedback, you can enable the feature using the instructions provided in the Detailed release highlights section.

New timezone selector and time controls in the user interface
We implemented a new timezone picker and time controls to enhance administrative abilities in the dashboard.

Docker image POWER8+ support
Netdata Docker images now support recent IBM Power Systems, Raptor Talos II, and more.

And more...
Four new collectors, 112 total improvements, 95 bug fixes, 49 documentation updates, and 57 packaging and installation changes!

Detailed release highlights

New Cloud backend and Agent communication protocol

It's no secret that the best of Netdata Cloud is yet to come. After several months of developing, testing, and benchmarking a new architectural system, we have steadied ourselves for that growth. These changes should offer notable and immediate improvements in reliability and stability, but more importantly, they allow us to quickly and efficiently develop new features and enhanced functionality. Here's what you can look for on the short-term horizon, thanks to our new architecture:

  • Greater capacity: The new architecture will change the communication protocol between the Agent and the Cloud to be incremental, improving our agent-handling capacity by ensuring that the Cloud uses measurably less bandwidth.
  • Parent/child relationships: The new architecture will allow, for the first time, the recognition of parent child relationships in the Cloud. These changes will enable you to change storage configuration on parents, limit sent metrics, and reduce data frequency to achieve a longer data retention for your nodes. Atop of this, we will continue to develop the ability for you to have complex setups to scale your monitoring with parents as proxies. Ultimately, this will enable Netdata to operate as a headless connector with the lowest footprint possible on your production nodes.
  • Alerts: The new architecture will host a multitude of improvements on our alerts presentation over the coming months, allowing for enhanced reliability, alert management, alert logs to be collected in the Cloud, and more.

If you would like to be among the first to test this new architecture and provide feedback, first make sure that you have installed the latest Netdata version following our guide. Then, follow our instructions for enabling the new architecture.

eBPF container monitoring

We did a lot of work to enhance our eBPF container monitoring this release. First, we start with the development of full eBPF support for cgroups. As a refresher on just how important this update is: cgroups together with Namespaces are the building blocks for containers, which is the dominant way of distributing monitoring applications. We use cgroups to control how much of a given key resource (CPU, memory, network, and disk I/O) can be accessed or used by a process or set of processes. Our eBPF collector now creates charts for each cgroup, which enables us to understand how a specific cgroup interacts with the Linux kernel! 🤓

This enhances our already extensive monitoring by including cgroups for mem, process, network, file access, and more.

eBPF latency monitoring

By enabling eBPF monitoring on all systems that support it, Netdata has already been established as a world-leading distributor of eBPF! We use eBPF to monitor all running processes, without the cooperation of the processes, by tracking any way the application interfaces with the system. And in this release, we continue our commitment to further improve eBPF by tracking latencies by disks, IRQs, etc.

Our new eBPF latency features include:

  • A new set of Disk I/O latency charts, which monitor the time that it takes for an I/O request to complete. As many of you may know, this is the most important metric for storage performance!
  • Latency IRQs monitoring to help anyone with time spent servicing interrupts (hard or soft).
  • A new Filesystem submenu that adds latency monitoring for different filesystems: BTRFS, Ext4, NFS, XFS and ZFS. The latency monitoring was brought for the most common functions, like latency for each open request and latency for each sync request.

eBPF is a very strong addition to our monitoring tools, and we are committed to provide the best experience with monitoring with eBPF from a distance without disrupting the data flow!

Other eBPF enhancements

But we didn't stop there with eBPF in v1.32.0. We also provided the following updates:

  • We moved VFS to a Filesystem menu to simplify the visualization of events realized by filesystems. This allows you to monitor actions of filesystems and their latency.
  • Until now, Netdata had metrics that demonstrated the amount of swap usage. eBPF.plugin now extends the swap monitoring to show how a specific application group/cgroup is performing action on SWAP.
  • We have improved process management monitoring by adding monitoring to shared memory and using tracepoints to monitor process creation and exit with more accuracy.
  • Netdata also brings monitoring for OOM Kill events for each apps groups defined on host.

If you share our interest in eBPF monitoring, or have questions or requests, feel free to drop by our Community forum to start a discussion with us.

Machine learning (ML) powered anomaly detection

Machine learning (ML) is undeniably a wave of the future in monitoring and troubleshooting. The Netdata community is riding that wave forward together, ahead of everyone else. Netdata v.1.32.0 introduces some foundational capabilities for ML-driven anomaly detection in the agent. We have integrated the popular dlib c++ ml library to power unsupervised anomaly detection out-of-the-box.

While this functionality is still under development and subject to change, we want to develop this with you, as a team. The functionality is disabled by default while we dogfood the feature internally and build additional ML-leveraging features into Netdata Cloud. But you can go to the new [ml] section in netdata.conf and set enabled=yes to turn on anomaly detection. After restarting Netdata, you should see the Anomaly Detection menu with charts highlighting the overall number and percent of anomalous metrics on your node. This can be a very useful single number summary of the state of your node.

Share your feedback by emailing us at [email protected] or just come hang out in the 🤖-ml-powered-monitoring channel of our discord, where we discuss all things ML and more!

And then, be on the lookout for some bigger announcements and launches relating to ML over the next couple of months.

New timezone selector and time controls in the user interface

Collaborating in a remote world across regions can be difficult, so we wanted to make it easier for you to sync with your administrative teams and your system information. Our new timezone selector allows you to select a timezone to accommodate collaboration needs within your teams and infrastructure. Additionally, we have added the following time controls to allow you to distinguish if the content you are looking at is live or historical and to refresh the content of the page when the tabs are in the background:

  • Play: When this option is selected, the content of the page will be automatically refreshed while this is in the foreground.
  • Pause: When this option is selected, the content of the page will not refresh due to a manual request to pause it or, for example, when you are investigating data on a chart (cursor is on top of a chart)
  • Force Play: When this option is selected, the content of the page will be automatically refreshed even if this is in the background.

Docker image POWER8+ support

And on top of all of that, we have added 64-bit little-endian POWER8+ support to our official Docker images, allowing the use of Netdata Docker images on recent IBM Power Systems, Raptor Talos II, and similar POWER based hardware, extending the list of what is currently supported for our Docker images, which includes:

  • 32 and 64 bit x86
  • ARMv7
  • AArch64

Acknowledgments

  • @nabijaczleweli for fixing writing updater log under root.
  • @MikaelUrankar for fixing calculation of sysctl mib size in freebsd plugin.
  • @filip-plata for adding additional metrics to python.d/postgres collector.
  • @eltociear for fixing typos.
  • @gotjoshua for adding a link to python.d/httpcheck.conf.
  • @wangpei-nice for fixing ebpf.plugin segfault when ebpf_load_program returns null pointer.
  • @zanechua for adding Microsoft Teams to supported notification endpoints.
  • @diizzyy for adding support for Intel 2.5G and Synopsys DesignWare nic driver in freebsd plugin.
  • @Saruspete for fixing handling of adding slabs after discovery in slabinfo plugin.
  • @mjtice for adding autovacuum and tx wraparound charts to python.d/postgres.
  • @charoleizer for adding PostgreSQL version to requirements section.
  • @danmichaelo for fixing a typo in exporting docs.
  • @oldgiova for adding capsh check before issuing setcap cap_perfmon.
  • @oldgiova for adding Travis ctrl file for checking if changes happened.
  • @0x3333 for fixing an inconsistent status check in charts.d/apcupsd.
  • @etienne-napoleone for adding terra related binaries to blockchains apps plugin group.
  • @anayrat for fixing postgres replication_slot chart on standby.
  • @vpiserchia for fixing handling of null values returned by _cat/indices API in python.d/elasticsearch.
  • @elelayan for fixing zpool state parsing in proc/zfs.
  • @steffenweber for adding missing privilege to fix MySQL slave reporting.
  • @unhandled-exception for adding sorting of the list of databases in alphabetical order in python.d/postgres.
  • @78Star for updating Netdata and its dependencies versions for pfSense.
  • @unhandled-exception for fixing crashing of the wal query if wal-file was removed concurrently in python.d/postgres.
  • @rupokify for updating jQuery dependency.
  • @caleno for fixing a typo in streaming docs.
  • @rex4539 for fixing typos.

Dashboard


Collectors

New

Improvements

  • Add AWS to apps_groups.conf (#11826, @ilyam8)
  • Show stats for systemd protected mount points (diskspace plugin) (#11767, @vlvkobal)
  • Add support for v1.7.0+ (go.d/coredns) (#619, @georgeok)
  • Add "/basic_status" job nginx.conf (go.d/nginx) (#612, @ilyam8)
  • Add sharding metrics (go.d/mongodb) (#609, @georgeok)
  • Add thread operations metrics (go.d/mysql) (#607, @ilyam8)
  • Add replica sets metrics (go.d/mongodb) (#604, @georgeok)
  • Add databases metrics (go.d/mongodb) (#602, @georgeok)
  • Add more OS(OperatingSystem) charts (go.d/wmi) (#593, @ilyam8)
  • Add caddy job to prometheus.conf (go.d/prometheus) (#581, @odyslam)
  • Add AOF file size metrics (go.d/redis) (#578, @ilyam8)
  • Add openethereum/geth jobs to prometheus.con (go.d/prometheus) (#578, @odyslam)
  • Update whois/whois-parser packages and add timeout configuration option (go.d/whoisquery) (#576, @ilyam8)
  • Disable reporting min/avg/max group uptime by default (apps plugin) (#11609, @ilyam8)
  • Add sorting of the list of databases in alphabetical order (python.d/postgres) (#11580, @unhandled-exception)
  • Add terra related binaries to blockchains group (apps plugin) (#11437, @etienne-napoleone)
  • Add instruction per cycle charts (perf plugin) (#11392, @thiagoftsm)
  • Add autovacuum and tx wraparound charts (python.d/postgres) (#11267, @mjtice)
  • Add support for Intel 2.5G and Synopsys DesignWare nic driver (freebsd plugin) (#11251, @diizzyy)
  • Add web3 and blockchains groups (apps plugin) (#11220, @odyslam)
  • Implement merging user/stock configuration files (python.d plugin) (#11217, @ilyam8)
  • Rename default job from 'local' to 'anomalies' (python.d/anomalies) (#11178, @andrewm4894)
  • Add standby lag and blocking transactions charts (python.d/postgres) (#11169, @filip-plata)

Bug fixes

  • Fix renaming for cgroups with dots in the path (cgroups plugin) (#11775, @vlvkobal)
  • Fix exiting on SIGPIPE (go.d plugin) (#630, @ilyam8)
  • Fix domain syntax validation (go.d/whoisquery) (#629, @ilyam8)
  • Fix missing NONE in valid request methods (go.d/squidlog) (#621, @ilyam8)
  • Remove wrong "queue_messages_in_queues" chart (go.d/vernemq) (#601, @ilyam8)
  • Fix HTTP/socket client initialization order (go.d/phpfpm) (#591, @ilyam8)
  • Fix scraping metrics when resources are not discovered (go.d/vsphere) (#589, @ilyam8)
  • Fix LTSV log format parsing (go.d/weblog) (#584, @ilyam8)
  • Fix expiration date parsing (go.d/whoisquery) (#575, @ilyam8)
  • Fix containers name resolution for crio/containerd runtime (cgroups plugin) (#11756, @ilyam8)
  • Add sensors to charts.d.conf and add a note on how to enable it (charts.d plugin) (#11715, @ilyam8)
  • Fix crashing of the wal query if wal-file was removed concurrently (python.d/postgres) (#11697, @unhandled-exception)
  • Fix "lsns: unknown column" logging (cgroups plugin) (#11687, @ilyam8)
  • Fix nfsd RPC metrics and remove unused nfsd charts and metrics (proc/nfsd) (#11632, @vlvkobal)
  • Fix "proc4ops" chart family (proc/nfsd) (#11623, @ilyam8)
  • Fix swap size calculation (cgroups plugin) (#11617, @vlvkobal)
  • Fix RSS memory counter for systemd services (cgroups plugin) (#11616, @vlvkobal)
  • Fix VBE parsing (python.d/varnish) (#11596, @ilyam8)
  • Remove unused synproxy chart (proc/synproxy) (#11582, @vlvkobal)
  • Fix zpool state parsing (proc/zfs) (#11545, @elelayan)
  • Fix null values returned by '_cat/indices' API (python.d/elasticsearch) (#11501, @vpiserchia)
  • Fix replication_slot chart on standby (python.d/postgres) (#11455, @anayrat)
  • Fix an inconsistent status check (charts.d/apcupsd) (#11435, @0x3333)
  • Fix plugin name (stats.d plugin) (#11400, @vlvkobal)
  • Fix plugin names (freebsd and macos plugins) (#11398, @vlvkobal)
  • Fix lack of "module" in chart definition (all chart.d modules) (#11390, @ilyam8)
  • Fix various python modules charts contexts (python.d/smartd_log, mysql, zscores) (#11310, @ilyam8)
  • Fix current operation charts title and context (proc/mdstat) (#11289, @ilyam8)
  • Fix handling of adding slabs after discovery (slabinfo plugin) (#11257, @Saruspete)
  • Fix calculation of sysctl mib size (freebsd plugin) (#11159, @MikaelUrankar)

eBPF

New

Improvements

Bug fixes


Health

Improvements

Bug fixes


Documentation

Packaging / Installation

Other Notable Changes

Improvements

Bug fixes

Deprecation notice

An upcoming stable release of the Netdata agent will include a maintainability update to our base Docker image.
A small percentage of users will find that all self-compiled packages must be manually rebuilt after the update, even if relocation/SONAME errors are not encountered. --security-opt=seccomp=unconfined can be passed with no default.json, but this introduces security vulnerabilities between the host and malicious code in the container.

Alternatively, users can prepare for the update by upgrading to one of the following:

  • runc v1.0.0-rc93
  • Docker 19.03.9 or greater AND libseccomp 2.4.2 or greater

While Netdata previously avoided making this update to minimize inconvenience to our users, we are now facing a third-party end-of-life date, and we believe the minimal number of affected users substantiates the need for the change.

Additionally, in a future stable release, we will be removing our legacy agent-to-cloud connection. Most users should see no change in this upgrade, but we will lose SOCKS 5 proxy support for the Netdata Cloud functionality, which will affect a small number of users.

Support options

As we grow, we stay committed to providing the best support ever seen from an open-source solution. Should you encounter an issue with any of the changes made in this release or any feature in the Netdata agent, feel free to contact us by one of the following channels:

  • Github: You can use our Github repo to report bugs and submit feature requests
  • Community forum: You can visit our community forum for questions and training.
  • NEW: Discord: You can jump into our Discord for interactive, synchronous help and discussion. More than 700 engineers are already using it! Join us!
netdata - v1.10.0

Published by firehol-automation over 6 years ago

New to netdata? Check its demo: https://my-netdata.io

User Base Monitored Servers Sessions Served

New Users Today New Machines Today Sessions Today


Posted on twitter, facebook, reddit r/linux,


Hi all,

Another great netdata release: netdata v1.10.0 !

This is a birthday release: netdata is now 2 years old !

Many thanks to all the contributors that help building, enhancing and improving a project useful and helpful for thousands of admins, devops and developers around the world! You rock!

- @ktsaou

At a glance

netdata now has a new web server (called static) with a fixed number of threads, providing a lot better performance and finer control of the resources allocated to it.

All dashboard elements (javascript) have been updated to their latest versions - this allows a smoother experience when embedding netdata charts on third party web sites and apps.


IMPORTANT: all users using older netdata are advised to update to this version. This version offers improved stability, security and a huge number of bug fixes, compared to any prior version of netdata.


new plugins

  • BTRFS - monitor the allocations of BTRFS filesystems (yes, netdata can now properly detect when btrfs is going out of space)
  • BCACHE - monitor the caching block layer that allows building hybrid disks using normal HDDs and SSDs
  • Ceph - monitor ceph distributed storage
  • nginx plus - monitor the nginx+ web servers
  • libreswan - monitor IPSEC tunnels
  • Traefik - monitor traefik reverse proxies
  • icecast - monitor icecast streaming servers
  • ntpd - monitor NTP servers
  • httpcheck - monitor any remote web server
  • portcheck - monitor any remote TCP port
  • spring-boot - monitor java spring boot applications
  • dnsdist - monitor dnsdist name servers
  • hugepages - monitor the allocation of Linux hugepages

enhanced / improved plugins

  • statsd
  • web_log
  • containers monitoring
  • system memory
  • diskspace
  • network interfaces
  • postgres
  • rabbitmq
  • apps.plugin
  • haproxy
  • uptime
  • ksm
  • mdstat
  • elasticsearch
  • apcupsd
  • isc-dhcpd
  • fronius
  • stiebeleltron

new alarm notifications methods

  • alerta
  • IRC

And as always, hundreds more enhancements, improvements and bugfixes.


BTRFS monitoring

BTRFS space usage monitoring and related alarms.

netdata is able to detect if any of the space-related components (physical disk allocation, data, metdata and system) of BTRFS is about the become exhausted!

#3150 - thanks to @Ferroin for explaining everything about btrfs...

screenshot from 2017-12-19 01-15-38

bcache monitoring

netdata now monitors bcache metrics - they are automatically added to any disk that is found to be a bcache disk.

ceph monitoring

New plugin to monitor ceph, the unified, distributed storage system designed for excellent performance, reliability and scalability (#3166 @lets00).

containers and VMs monitoring

  • netdata now monitors systemd-nspawn containers.
  • netdata now renames charts of kubernetes containers.
  • virsh is now called with -r to avoid prompting for password #3144
  • cgroup-network is now a lot more strict, preventing unauthorized privilege escalation #3269
  • cgroup-network now searches for container processes in sub-cgroups too - this improves the mapping of network interfaces to containers
  • cgroup-network now works even when there are no veth interfaces in the system

monitor ntpd

netdata can now monitor isc-ntpd. @rda0 did a marvelous job decoding NTP Control Message Protocol, collecting ntpd metrics in the most efficient way #3421, #3454 @rda0

ntpd_system

btw, netdata also monitors chrony but the chrony module of netdata is disabled by default, because certain CentOS versions ship a version of chrony that consumes 100% cpu when queried for statistics.

nginx plus web servers monitoring

Added python plugin to monitor the operation of nginx plus servers. The plugin monitors everything about nginx+, except streaming #3312 @l2isbad

libreswan IPSEC tunnels monitoring

netdata now monitors libreswan tunnels - #3204
screenshot from 2018-01-03 00-32-14

remote HTTP/HTTPS server monitoring

netdata now has an httpcheck plugin (module of python.d.plugin), that can query remote http/https servers, track the response timings and check that the response body contains certain text #3448 @ccremer .

httpcheck

remote TCP port monitoring

netdata now has portcheck plugin (module of python.d.plugin), that can check any remote TCP port is open #3447 @ccremer

portcheck

icecast streaming server monitoring

netdata now monitors icecast servers #3511 @l2isbad.

traefik reverse proxy monitoring

netdata now monitors traefik reverse proxies - #3557.

spring-boot monitoring

netdata can now monitor java spring-boot applications @Wing924
2018-02-23 11 34 37
2018-02-23 11 34 48

dnsdist

netdata now monitors dnsdist name servers - @nobody-nobody #3009

statsd

  • statsd dimensions now support the options the external plugin dimensions support (currently the only usable option is hidden to add the dimension, but make it hidden on the dashboard - a hidden dimension can participate in various calculations, including alarms).
  • statsd now reports the CPU usage of its threads at the netdata section.
  • statsd metrics are logged to access.log the first time they are encountered.
  • statsd metrics now accept the special value zinit to allow them get initialized without altering their values (this is useful if you have rare metrics that you need to initialize when netdata starts).
  • statsd over TCP is now a lot faster - netdata can process up to 3.5mil statsd metrics / second using just one core. Added options to control the timeouts of TCP statsd connections.
  • fixed the title and context of statsd private charts
  • statsd private charts can now be hidden from the dashboard #3467

postgres

Several new charts have been added to monitor (#3400 by @anayrat):

  1. checkpointer charts
  2. bgwriter charts
  3. autovacuum charts
  4. replication delta charts
  5. WAL archive charts
  6. WAL charts
  7. temporary files charts

Also, the postgres plugin now also works when postgres is in recovery mode.

rabbitmq

  • added Erlang run queue chart. This is useful in conjunction with the existing Erlang processes chart to get a better overall idea of what's going on in the Erlang VM. @arch273
  • added rabbitmq information on the dashboard to complement the charts.

apps.plugin

netdata prior to this version was detecting the user and group of processes by examining the ownership of /proc/PID/stat. Unfortunately it seems that the owneship of files in /proc do not change when the process switches user. So, netdata could not detect the user and group of processes that started as root and then switched to another user.

Now netdata reads /proc/PID/status:

  • process ownship information is now accurate
  • eliminated the need to read /proc/PID/statm (all the information of /proc/PID/statm is available in /proc/PID/status)
  • allowed netdata to read VmSwap, so a new chart has been added to monitor the swap memory usage per process, user and group. screenshot from 2018-02-24 15-07-47
  • fixed issue with unreasonable spikes on processes cpu on FreeBSD (there was a typo) #3245
  • fixed issue with errors reported on FreeBSD about pid 0 #3099

The new plugin is 20% more expensive in terms of CPU. We tried hard to optimize it, but this is as good as it can get. Read about it at #3434 and #3436

haproxy

Added charts:

  • hrsp_1xx, hrsp_2xx, hrsp_3xx, hrsp_4xx, hrsp_5xx, hrsp_other, hrsp_total for backands and frontends
  • qtime, ctime, rtime, ttime metrics for backend servers
  • backend servers In UP state

@ktarasz

uptime

netdata now uses /proc/uptime when CLOCK_BOOTTIME does not report the same uptime. In containers CLOCK_BOOTTIME reports the uptime of the host, while /proc/uptime reports the uptime of the container, so now netdata correctly reports the uptime of the container.

mdstat

various fixes to better monitor rebuild time and rate @l2isbad

KSM

  • removed to_scan dimension
  • the savings % reported by netdata was less than the actual - fixed it.

elasticsearch

Added several charts for translog / indices segments statistics and JVM buffer pool utilization, which are often helpful when evaluating an elasticsearch node health #3544 @NeonSludge

memory monitoring

  • treat slab memory as cached #3288 @amichelic
  • added a new chart for monitoring the memory available for use, before hitting swap screenshot from 2018-01-07 03-38-30
  • netdata now monitors Linux hugepages and transparent hugepages screenshot from 2018-02-24 14-28-44
  • added hugepages monitoring #3462screenshot from 2018-02-23 15-07-26

diskspace monitoring

  • support huge amounts of mountpoints #3258 - netdata was crashing with stack overflow due to recursion - now it is loop, so any number of mount points is supported

network monitoring

  • moved tcp passive and active opens to a separate chart, to allow the TCP issues dimensions scale better by default #3238
  • updated the information presented on TCP charts to match the latest v4.15 kernel source #3239

APC UPS

netdata now supports monitoring multiple APC UPSes.

ISC DHCPd

netdata now also supports monitoring IPv6 leases - @l2isbad

fronius

  • added a new dimension solar_consumption @ccremer
  • added alarms @ccremer

stiebeleltron

  • added alarms @ccremer

web_log

Added web server response timings histogram #3558 @Wing924 .
2018-03-19 0 06 00

python.d.plugin

  • python.d.plugin can now start even if /etc/netdata/python.d.conf is missing @l2isbad
  • python.d.plugin now has an internal run counter @l2isbad
  • the unicode decoding of the plugin has been fixed (#3406) @l2isbad
  • the plugin now does not validate self-signed certificates @l2isbad
  • the plugin can not revive obsolete charts @l2isbad

charts.d.plugin

charts.d.plugin BASH modules can now have custom number of retries in case of data collection failures #3524.

web server

  • netdata now has a new internal web server that supports a fixed number of threads - we call it static web server. This web server allows netdata to work around memory fragmentation (since the treads are fixed, the underlying memory allocators reuse the same memory arenas) and cpu utilization (we can control the number of threads that will be used by netdata). This is the default now. #3248
  • now the static threads web server reports the CPU usage of each of its threads.
  • the HTTP response headers now include the netdata version

dashboard

  • the print button now respects the URL path netdata is hosted.

  • dygraphs updated to the latest version - this fixes an issue that prevented netdata charts from being interactive under certain conditions

  • added dygraph theme logscale #3283

  • fontawesome updated to version 5

  • d3 updated to the latest version (this broke c3 charts that require an older version)

  • added d3pie charts optimized-d3pie

  • custom dashboards can now have alarms for specific roles (all, none, one or more).

  • allow stacked charts to zoom vertically when dimensions are selected peek 2018-01-27 13-35

  • netdata now has a global XSS protection #3363 screenshot from 2018-01-30 00-30-05

  • netdata now uses intersectionObserver when available #3280 - this improves the scrolling performance of the dashboard.

  • prevent date, time and units from wrapping at the charts legends #3286

  • various units scaling improvements #3285

  • added data-common-colors="NAME" chart option for custom dashboards #3282.

  • added wiki page for creating custom dashboards on Atlassian's Confluence. final-confluence4

  • prevented a double click on the charts' toolbox to select the text of the buttons.

  • fixed the alignment of dashboard icons #3224 @xPaw

  • added a simple js, called refresh-badges.js, to update badges on a custom web page

badges

netdata badges can now be scaled #3474

screenshot from 2018-02-26 01-50-33
screenshot from 2018-02-26 01-50-55
screenshot from 2018-02-26 01-51-21

API

  • added gtime parameter, for group time. This is used to request from netdata to return values in a different rate (i.e. gtime=60 on a X/sec dimension, will return X/min).
  • fixed a rounding bug in JSON generation #3309
  • the dimensions= parameter now supports simple patterns #3170 and added option values match-ids and match-names to control which matches are executed for dimensions.

alarms

  • system.swap alarms now send notifications with a 30 seconds delay, to work-around a kernel bug that incorrectly reports all swap as instantly used under containers #3380.

  • added alarm to predict the time a mount point will run out of inodes #3566.

  • all system alarms are now ported to FreeBSD too #3337 @arch273

  • added alerta.io notifications @kattunga

  • added available memory alarm screenshot from 2018-01-07 03-39-05

  • removed unsupported html tags from hipchat notifications.

  • pagerduty notifications have been modified to avoid incident duplication #3549.

  • alarm definitions can now use both chart IDs and chart names (prior to this version only chart IDs were allowed).

  • curl options (eg for disabling SSL certificates verification) for alarm-notify.sh can now be defined in health_alarm_notify.conf.

  • netdata can now send notifications to IRC channels #3458 @manosf

    IRCCloud web client:
    image

    Irssi terminal client:image

backends

  • on netdata masters, allow filtering the hosts that will be sent to backends with send hosts matching = * pattern.
  • improved connection error handling and added retries to allow netdata connect to certain backends that failed with EALREADY or EINPROGRESS.
  • json backends now receive host tags (the tags have to be formatted in a json friendly way) #3556.
  • re-worked the alarm that triggers when backend data are lost, to avoid flip-flops.

prometheus backends

  • added URL option timestamps=yes|no to /api/v1/allmetrics to support prometheus Pushgateway #3533
  • added netdata_info variable with the version of netdata
  • renamed netdata_host_tags to netdata_host_tags_info (the old exists but is deprecated and will be removed eventually)
  • when prometheus uses average metrics, netdata remembers the last access time the prometheus collected metrics, on a per host basis.

metrics streaming between netdata

  • netdata masters and proxies now expose the version of the netdata collecting the metrics, not their own. So, now a netdata master shows on the dashboard and sends to backends the version of the netdata collecting the metrics #3538.
  • added stream.conf option multiple connections = accept | deny to allow or deny multiple connection for the same netdata host. The default remains accept, but it is likely to be changed to no on future versions.

packaging

  • added docker hub builds for aarch64/arm64 @justin8
  • updated debian containers to use stretch @justin8
  • added FreeBSD init file
  • various installers fixes and improvements (make sure netdata is started, do not give information about features not supported on each operating system, allow non-root installations without errors, etc.)
  • various installer fixes for FreeBSD and MacOS
  • netdata-updater was growing the PATH variable on each of its runs - fixed it.
  • added --accept and --dont-start-it command line options to kickstart-static64.sh
  • netdata can be compiled with long double support (useful in embedded devices that don't support long double numbers) #3354
  • fixed netdata.spec to allow building netdata on older and newer rpm based distros. Also added a script to build a netdata rpm
  • static netdata installer now tries to find the location of the SSL ca-certificates on a system and properly configured the static curl provided with this path.
  • the netdata updater starts netdata only if it was running
  • added alpine dockerfile

other

  • added global option gap when lost iterations to control the number of iterations that should be lost to show a gap on the charts.
  • various fixes/improvements related to netdata logs - the main change is that now netdata logs the thread name that logged the message, providing helpful insights about the thread that complained.
  • re-worked the exit procedure of netdata to allow it cleanup properly - sometimes netdata was deadlocked during exit, waiting forever - now netdata always exits promptly #3184
  • fixed compilation on ancient gcc versions
  • netdata was always setting itself to the idle process scheduling priority, even when it was configured to do otherwise. Fixed it #3523
netdata -

Published by firehol-automation almost 7 years ago

New to netdata? Check its demo: https://my-netdata.io

User Base Monitored Servers Sessions Served

New Users Today New Machines Today Sessions Today


Overview of netdata v1.9

  1. snapshots
    We can now save and load dashboard snapshots for any timeframe in any resolution. snapshots allow us to save artifacts, evidence, documentation of incidents, or just the raw data for postmortem analysis.

  2. highlighted time-frame
    We can now highlight a selected time-frame on all dashboard charts. So, to quickly compare charts press ALT or CONTROL and select an area on one chart. The same area will be highlighted on all charts.

  3. export to PDF
    We can now export netdata dashboards to PDF, for any timeframe with any detail.

  4. access lists (IP filtering)
    We can now setup IP filtering at netdata.conf for all functions of netdata (dashboard access, streaming, registry, badges, etc - no more iptables rules for protecting netdata).

  5. TCP overflows and connection drops
    netdata can now detect TCP listening sockets overflows and connection drops, for any server running on the host (even the ones netdata is not aware of).

  6. libvirt VMs
    netdata now detects libvirt network interfaces and moves them to VM section of the dashboard (it also supports .libvirt-qemu naming of cgroups).

  7. Units auto-scaling
    netdata dashboards can now scale units (KB -> MB -> GB -> TB, etc), on the fly.

  8. Units conversions
    netdata dashboards can now convert units (eg. Celsius to Fahrenheit, seconds to HH:MM:DD, etc), on the fly.

  9. Multiple Timezones
    netdata dashboards can now change timezone on the fly (yes, we can now compare charts with server logs).

  10. python.d.plugin rewritten
    @l2isbad rewrote the whole of it, to add flexibility and support the latest netdata features! The new plugin supports the old python modules.

  11. better / faster dashboard scrolling
    netdata now uses passive event listeners to detect page scrolling. This improved significantly the responsiveness of the dashboard (check your dashboard settings: sync scrolling is the fastest, async is closer to the older behavior).

  12. netdata now monitors couchdb, powerdns, beanstalkd and dnsdist !

  13. netdata now detects redis background save failures

  14. netdata can now send flock.com and kavenegar.com alarm notifications

and as always... dozens more improvements, enhancements, new features and bug fixes!


netdata dashboard snapshots !

Netdata can now export and import dashboard snapshots.

Snapshots are JSON files containing everything the dashboard needs to be rendered: charts and chart data.

They are exported as JSON files, to your computer. The saved snapshots can be loaded back on any netdata dashboard (even of different host). When importing, not network traffic is generated. The web browser loads the local file and renders an interactive dashboard to examine it.

The current visible timeframe of the dashboard is respected, so first align the dashboard to the timeframe required and the click "Export". The pop-up allows selecting the resolution of the export (its detail).

peek 2017-11-13 13-13


highlighted time-frame !

Press the ALT or CONTROL key and select a time-frame at a chart. An overlay will appear with the selected time-frame and all the charts will highlight the same region.

The highlighted time-frame:

  1. Is added to the URL hash, so that reloading the page keeps it
  2. Is propagated to other netdata servers, via the my-netdata menu
  3. Is save in dashboard snapshots (and of course restored when they are loaded back)

peek 2017-11-19 19-39

Also, netdata charts can now be zoomed vertically (use the SHIFT key, like in zoom, but select the chart vertically):

peek 2017-11-19 20-10


netdata dashboards to PDF !

netdata dashboards can now be printed to PDF. Just click the 🖨️ icon on the dashboard.

The current visible timeframe of the dashboard is respected, so first align the dashboard to the timeframe required and the click "Print".

peek 2017-11-11 19-55


netdata now supports API access lists (IP filtering)

netdata can now check the client IPs connecting to it and deny/allow access based on your settings. No more iptables rules to control access to netdata.

All these settings are netdata simple patterns that are checked against the client IP (string matching - not subnet matching). localhost clients (IPv4, IPv6 and unix domain sockets) can be matched with localhost:

Global access control

  • [web].allow connections from to match the clients' IPs allowed to connect to netdata. This has the same effect with iptables (but implemented at the application level - so clients will get connected, and disconnected immediately if they are not allowed access, without any response from netdata).

Dashboard access control

  • netdata.conf: [web].allow dashboard from to match the clients' IPs that are allowed to access the dashboard (ie fetch static files and query netdata API).
  • netdata.conf: [web].allow badges from to match the clients' IPs that are allowed to access badges (the dashboard clients are allowed to access badges too, so this setting allows badges to clients that do not have access to the dashboard).

Streaming access control

  • netdata.conf: [web].allow streaming from to match the the clients' IPs that are allowed to stream to stream metrics.
  • stream.conf: [API_KEY].allow from to match the clients' IPs allowed to push metrics for the given API KEY.
  • stream.conf: [MACHINE_GUID].allow from to match the clients' IPs allowed to push metrics for the specific machine.

netdata will also check the API keys supplied by slaves and proxies connected.

Other access lists

  • netdata.conf: [web].allow netdata.conf from to limit the clients that can get netdata.conf - by default netdata allows only private IPs.
  • netdata.conf: [registry].allow from to limit the clients allowed to access the registry (only when this netdata acts as a registry).

netdata detects TCP listening sockets overflowing or dropping connections

Added a new chart: ipv4.tcplistenissues with dimensions ListenOverflows and ListenDrops.

This chart detects if any listening TCP socket on the host, is overflown, or it drops connections. This is system-wide: any listening TCP socket, of any application.

The chart will not be shown if these kernel counters are zero. It will be enabled automatically if it is found non-zero at any point (it is collected via /proc/net/netstat every second). If you need to enable it even if it is zero, edit netdata.conf and set:

[plugin:proc:/proc/net/netstat]
	TCP listen issues = yes

Two alarms have been added, one for ListenOverflows and one for ListenDrops that detect if there is any overflow or drop in the last minute (they run every 10 seconds).

slack alarm for overflows:

image

slack alarm for drops:

image

and the alarms configuration:

screenshot from 2017-10-09 23-04-05

The alarms will automatically be attached when the chart is active.

The overflows dimension and alarm is supported on FreeBSD too.

/proc/net/sockstat and /proc/net/sockstat6

These files provide sockets statistics for all protocols.

screenshot from 2017-11-07 02-39-37

netdata also adds 3 new alarms:

  1. too many tcp orphan sockets
  2. tcp memory that detects that the tcp stack is under memory pressure or close to giving memory errors
  3. too many tcp connections (for kernels that do not support dynamic allocation of connections)

Streaming

  • netdata proxies with more than 100 slaves, had a timing issue that caused them to crash randomly on slave reconnects. Parts of the code have been rewritten to get rid of the timing issue.

  • netdata slaves and proxies, now have a protection that ensures they will never use 100% CPU, even if the master is misbehaving.

  • expired orphaned hosts are now removed from the my-netdata menu of the dashboard.

  • streaming functions can now be monitored via access.log

  • streaming now support IP filtering. So the entire streaming functionality, API keys and MACHINE GUIDs can be associated with one or more IPs or IP patterns.

  • streaming now transfers alarm variables too


python.d.plugin rewritten

@l2isbad did a marvelous job rewriting python.d.plugin. The new plugin:

  1. supports option autodetection_retry: SECONDS. When set to non-zero, the plugin will re-check the module every that many seconds. This solves the problem that netdata did not persist on collecting metrics from applications, if the application is not found running when netdata starts. By default is zero for all modules, so you need to enable it for all the applications you need it.

  2. got a rewrite of several functions, like logging, module configuration, chart and dimensions management.

  3. the new URL service disables by default certificates checks, to allow self-signed certificates to work without configuration.

The new plugin is compatible with custom python modules developed for the previous version.


web_log plugin

  • custom regex now supports parsing hostnames and IPs @l2isbad

  • web_log now parses lines with error 408 (request timeout - these are a special case, since the request has not received by the web server, so the log line is incomplete) @l2isbad

  • now properly parses resp_length with value - @racciari


couchdb monitoring

CouchDB maintainer @wohali, submitted a couchdb plugin for netdata. The plugin monitors:

  • database activity
  • http response codes
  • server operations
  • per DB statistics

mwsnap 2017-09-29 22_54_33
mwsnap 2017-09-29 22_54_44


redis monitoring

2 charts have been added to monitor background save health status, bundled with 2 alarms that detect if background save has failed, or background save is slow (warn > 10 mins, crit > 20min). @l2isbad

screenshot_20170925_092235


Other new and enhanced plugins

  • netdata now monitors PowerDNS, @l2isbad

  • netdata now monitors beanstalkd, @l2isbad

  • netdata now monitors dnsdist, @nobody-nobody

  • disks under Linux are renamed using /dev/disk/by-label. An option has been added at netdata.conf to also allow renaming based on /dev/disk/by-id.

  • chrony is now disabled by default, because there have been reports that chronyc enters an infinite loop in CentOS and RHEL.

  • tomcat improvements to support flavors of the tomcat server @Wing924

  • zfs on FreeBSD now monitors ZFS TRIM statistics

  • disks monitoring charts on FreeBSD got a lot more FreeBSD related dimensions.

  • added CPU frequency charts on FreeBSD (Linux already had them).

  • chart system.io (the total system Disk I/O) is now calculated by aggregating the reads and writes of all physical disks. The previous system.io chart (that is based on pgpgin and pgpgout from /proc/vmstat) is now named system.pgpgio. The key difference is that the new system.io now sees ZFS I/O, and it also correctly and accurately sums the real disk bandwidth of RAID arrays.

  • chart system.net (the total system network bandwidth) is now calculated by aggregating the bandwidth of all physical network interfaces and is common for both IPv4 and IPv6.

  • tc (QoS) charts now sort the dimensions on the legends, the same way tc reports them.

  • postgres versions <= 10 the WAL directory was named pg_xlog' and from 10 upwards has been renamed to pg_wal @facetoe

  • mysql (and mariadb) got new charts for galera replication @spinitron

  • openvpn_log improvements @l2isbad

  • smartd improvements @l2isbad

  • varnish module has been rewritten @l2isbad

  • mdstat regex fix @l2isbad

  • smartd_log improvements @l2isbad

  • dns_query_time improvements @wungad

  • isc_dhcpd improvements @wungad

  • freeipmi.plugin got a command line option (can be given at netdata.conf) to ignore certain sensor IDs that are faulty.

  • freeradius improvements @wungad

  • node.d.plugin bugfixes

Plugins protocol enhancements

  • netdata now supports multiple plugin directories. The setting is the same in netdata.conf, plugins directory = "DIRECTORY1" "DIRECTORY2" ..., up to 20 directories. By default netdata sets:
[global]
      plugins directory = "/usr/libexec/netdata/plugins.d" "/etc/netdata/custom-plugins.d"
  • netdata now supports alarms variables.

    Each plugin can now define host global and chart local variables with static values, that can be used in alarms' expressions. So, hosts and charts can now have any number of static values associated with them (eg. an application server may expose its max connections limit), and these static values can be used to trigger alarms (eg. the current connections, is compared to the max connections variable). The whole setup allows alarm templates to use this feature (eg each netdata can maintain different such variables for each server it monitors).

    Alarm variables are propagated to upstream netdata servers.


O/S - distro support

  • added init file for SLC 6.9 and CloudLinux Server release 6.9

  • packages installer was incorrectly detecting all python versions as version 2.

  • a makeself bug that prevented the static netdata binaries from being installed on busybox systems, has been fixed.

  • openrc startup script (gentoo, alpine) had hardcoded the path to netdata. This affected all static-64bit builds when installed on these distros. Fixed.

  • the static 64bit installer now downloads netdata.conf, much like the git installer does.

  • openrc / gentoo init improvements @candrews

  • enabled support for macOS versions 10.5+ (10.11 was working already) @vlvkobal

  • enabled support for FreeBSD 12 @vlvkobal

  • fixed a crash on macOS hosts with empty disk names.

  • added Dockerfile.armv7hf for running netdata under docker on ARM v7 machines @justin8


Dashboard improvements

  • hover selection of charts is now faster on all browsers. Perfect on Chrome, Firefox and Opera. Quite usable on Edge.

  • the dashboard is now fixed when a modal is open, preventing scrolling the page.

  • the dashboard now uses fontawesome 5.0.1 for icons.

  • the chart names can now be searched with browser control-F (find in page). netdata lazy loads all charts for it was impossible to search of a chart. Now the charts are searchable. This is important on dashboards with several hundreds of statsd charts, because all these charts appear under the same section.

  • netdata now detects libvirt VM network interfaces and moves them to the VM section of the dashboard. The same functionality already exists for containers.

    screenshot from 2017-10-31 01-32-43

  • Show the context of each chart. The context is used in alarm templates. (hover on the date of the chart)

    image

  • Show the resolution of the chart. (hover on the time of the chart)

    image

  • The dashboard now adds a tooltip at the date of the charts, to show the plugin and its module that collects each chart.

  • The dashboard should now put a lot less CPU pressure on the browser when the page does not have focus.

automatic units scaling

The dashboard does dynamic units scaling, on the fly ! It converts:

  • network bandwidth (kilobits/s to megabits/s or gigabits/s)
  • input/output bandwidth (kilobytes/s to megabytes/s or gigabytes/s, similarly for KB/s)
  • memory sizes (MB to KB, GB or TB)
  • disk sizes (GB to MB or TB)

Chart units dynamically adapt based on the value of the selected dimension too:

peek 2017-10-06 22-58

Custom dashboards can give data-desired-units="UNITS" and netdata will automatically convert the presented values to the desired units. UNITS can be any of the supported one, or auto for auto-scaling based on the values, or original to show the original units maintained by the netdata server.

units conversions

The dashboard now supports units conversions. Currently it converts:

temperatures from Celsius to Fahrenheit

image

seconds to human readable duration DDd:HH:MM:SS

image

timezone conversions

netdata can now convert all dates presented to any timezone. Traditionally netdata presented all charts at the timezone of the viewer. This allowed homogeneous central administration of systems that are installed all over the world. However, this was inefficient when we needed to compare the information presented on the dashboard, with the log files of the servers.

So, now netdata can present the charts on any timezone. The netdata server auto-detects the timezone of the server and new dashboard settings have been added to allow this conversion.

If autodetection of the servers timezone fails, the configuration option [global].timezone has been added in netdata.conf to set it. Also, the dashboard itself allows the viewers to configure the timezone (it is saved at browser local storage, so this has to be set just once per viewer).

new dashboard options

To support all the above, the dashboard settings got a new tab, with all the required options:

screenshot from 2017-10-10 23-54-01


statsd improvements

  • statsd metrics can now be added to statsd synthetic charts using patterns. No need to add a dimension line for each statsd metric to be added. netdata will also extract the wildcarded part of the metric name and use that one for the dimension name.

  • dimensions added to statsd synthetic charts, can automatically be renamed using a dictionary. Each synthetic charts application has its own dictionary of name - value pairs, which is used to automatically rename statsd metrics when they are added to synthetic charts.

  • statsd timers and histograms now report zeros when nothing is collected


Badges improvements

  • fixed a bug in netdata badges that was incorrectly matching zero values with the null color condition.

  • added API option display_absolute to allow badges use the signed value for color evaluation, but present the absolute value.


Other Alarm and Alarm Notifications Improvements

  • warning emails sent by netdata, are now a little bit more orange (they were a bit green'sh).

  • added flock.com notifications @tvarsis

  • added kavenegar.com support for SMS notifications @vahit

  • fixed a bug in email notifications that was triggering a corrupted MIME match by anti-spam solutions.

  • pushbullet notifications now track the devices, so that per device filtering at pushbullet is possible. Also improved the formatting a bit. @user501254

  • pushover notifications fixes (the priority of warnings was set incorrectly)

  • alarms can now use variables like this ${variable with spaces or +, -, *, / in it}. So, alarms can now use dimension names with any character in them.


Other Improvements

  • access.log has been refactored to support monitoring all netdata operations

  • inodes monitoring is now by default disabled for mount points based on filesystems that do not have a maximum inode threshold (such as cephfs).

  • rabbitmq has been added to apps_groups.conf so that apps.plugin now monitors (cpu, memory, disk I/O, sockets, etc) for rabbitmq instances.

  • several email and log management apps have been added to email and logs targets of apps_groups.conf, @Flums

  • ceph target added to apps_groups.conf to allow netdata monitor Ceph - the unified, distributed storage system, @k0ste

  • refactored several internal data collection plugins to eliminate a few hundreds of index lookups per second.

  • netdata.conf settings that are loaded from disk, but were the same with the default ones, were generated commented when the server was asked to give its config. Now all loaded settings are generated uncommented.

  • netdata simple patterns can now extract the the wildcarded part of the string they match (used in statsd synthetic charts)

  • netdata simple patterns can allow escaping spaces by prefixing them with a backslash.

netdata -

Published by firehol-automation about 7 years ago

New to netdata? Check its demo: https://my-netdata.io

User Base Monitored Servers Sessions Served

New Users Today New Machines Today Sessions Today


netdata v1.8.0 released.

This release focuses on metrics streaming improvements and containers monitoring.

As always, this netdata is the fastest and the more stable netdata ever! Update now!

To install or update netdata, click here!

key streaming improvements

bug fix: streaming slaves consuming 100% CPU

netdata, as a slave, was not handling all the error cases properly, resulting in 100% cpu utilization of a single core, under certain conditions. Especially under FreeBSD and macOS slaves, these conditions were always met, so using FreeBSD or macOS as netdata slaves, was completely broken.

bug fix: missing alarm notifications on netdata masters

netdata was incorrectly messing cached alarm state data between the alarms of the mirrored hosts, resulting in alarm notifications not dispatched under certain conditions. This was affecting only netdata masters (ie. netdata servers with more than one host databases, with health monitoring enabled). The alarms were generated and were visible at the dashboards, but the notifications were not always sent.

bug fix: streamed charts with duplicate names

There was a minor issue with charts that were created with name aliases. When these charts were streamed from netdata slaves to netdata masters, they ended up with duplicate chart names (ie instead of type.name they had type.type.name).


key containers monitoring improvements

  • Container network interfaces are now moved to the container section and they are rendered from the container view point (i.e. sent = what the container sent) - no more veth* garbage on the dashboard.

  • The interfaces also appear as eth0 (or whatever the container sees) and they are inside the container section of the dashboard. netdata maps each veth* interface to the right container, using plain cgroups features, so this works for all container managers (docker, lxc, etc).

  • Eliminated the nested containers shown under certain versions of lxc.

  • Also, containers and VMs now have summary gauges on the dashboard

    image


key plugins improvements

python.d.plugin now supports HTTP keep-alive

netdata now uses urllib3 (shipped with netdata for both python v2 and v3) for URLService based plugins.

This enables HTTP keep-alive on all connections, which allows netdata to have permanent connections to third party web applications.

Fixed by @l2isbad


compatibility enhancements

  • better support for Oracle Linux, by @schindlerd
  • better support for Alpine Linux
  • various fixes at the build procedure for macOS
  • fping can now run as non-root, in static binary netdata packages

netdata generic enhancements

  • netdata can now listen on UNIX domain sockets (.sock files). This allows a local web server and netdata to communicate bypassing the network stack (for netdata set bind to = unix:/path/to/netdata.sock - this option supports multiple arguments, so netdata can listen to multiple unix sockets and tcp sockets, at the same time).

  • netdata was assuming that the JSON representation of a chart would at most be 1024 bytes, and it was generating corrupted JSON output when any chart was exceeding that limit. Removed the limitation (ie. now there is no limit).

  • netdata was crashing while starting, if no usable disks were found.

  • systemd netdata.service now allows setting negative netdata OOM score and restarts netdata if it crashes. The new netdata.service is not automatically installed when updating netdata. Either delete /etc/systemd/system/netdata.service and then update/re-install netdata, or copy the file by hand.

  • minor fixes at the installer, by @vincele


new plugins

  • Added Intel CPU temperature charts on FreeBSD and macOS, by @vlvkobal
  • Added CPU thermal throttling charts on Linux (useful on physical servers and possibly laptops)
  • Added chrony plugin, by @domschl
  • Added Stiebel Eltron plugin to collect metrics from heat pumps and hot water installations from Stiebel Eltron ISG @BrainDoctor

improved plugins

  • web_log bugfixes, enhancements and optimizations (including squid logs), by @l2isbad
  • web_log now enables parsing HTTP/2 logs in custom_log_format, by @Funzinator
  • redis bugfixes, by @l2isbad
  • haproxy bugfixes, by @l2isbad
  • elasticsearch bugfixes and optimizations, by @l2isbad
  • rabbitmq bugfixes and optimizations, by @l2isbad
  • mdstat bugfixes, by @JeffHenson
  • tomcat improvements, by @Wing924
  • mysql improvements, by @alibo and @l2isbad
  • dovecot improvements
  • postgres improvements, by @facetoe
  • cpufreq fixed a bug that prevented accurate reporting of CPU frequencies. accurate works with the acpi-cpufreq driver and calculates the average CPU clock of the CPUs utilizing the accounting per frequency, as reported by the kernel, by @tycho
  • cpuidle performance improvements (faster under load) by @tycho
  • fail2ban bugfixes, by @l2isbad
  • SNMP plugin new uses latest net-snmp and the corrupted 64 bit counters encountered under certain node.js version is now fixed.

dashboard improvements

  • easypiecharts and gauges can now render arbitrary ranges and animate clock wise or counter clock wise.

  • traditionally netdata was using 1024 bits = 1 kilobit. It is fixed: 1000 bits = 1 kilobit.

  • netdata charts should now work on wordpress pages.


alarms and notifications

  • alarm-notify.sh now supports debug mode, showing the exact commands it runs to send notifications, when export NETDATA_ALARM_NOTIFY_DEBUG=1

  • alarm-notify.sh now supports setting the sender email address of the emails it sends.

  • emails sent by alarm-notify.sh now include headers to reduce the possibility of them being scored as spam, by @Ferroin

  • network related alarms got new thresholds and improved badges

  • netdata now detects if the system has been suspended and pauses all alarms for 60 seconds on resume, to prevent false alarms (no more false alarms on laptops when they resume).

  • netdata alarms now support filtering based on hostname and O/S (linux, freebsd, macos). This means that netdata masters, can now support alarms for slaves of any O/S (i.e. a Linux netdata master can handle alarms for a FreeBSD slave).

  • netdata slack notifications now show the host sent the alarm. In the image below, the alarm is about bangalore, and is sent by netdata-build-server (at the lower left corner):

    image


statsd

  • the number of fractional points supported by statsd is now configurable (1 to 7).
  • 95th percentile calculation on statsd histograms and timers, was incorrectly averaging the values. It is now fixed.
  • statsd metrics with non ASCII text were processed by the statsd server, but were breaking JSON data generated by netdata. Fixed it by replacing all invalid characters.
netdata - v1.7.0

Published by philwhineray over 7 years ago

New to netdata? Check its demo: https://my-netdata.io

User Base Monitored Servers Sessions Served

New Users Today New Machines Today Sessions Today


This is release v1.7 of netdata.

netdata is still spreading fast: we are at 320.000 users and 132.000 servers! Almost 100k new users, 52k new installations and 800k docker pulls since the previous release 4 and a half months ago! netdata user base grows at about 1000 new users and 600 new servers per day! Thank you! You are awesome!

The next release (v1.8) will be focused on providing a global health monitoring service, for all netdata users, for free! Read more about it here. We need supporters for this cause. Join us!

highlights of netdata v1.7

  1. netdata is now a (very fast) fully featured statsd server and the only one with automatic visualization: push a statsd metric and hit F5 on the netdata dashboard: your metric visualized. It also supports synthetic charts, defined by you, so that you can correlate and visualize your application the way you like it.

  2. netdata got new installation options - it is now easier than ever to install netdata - we also distribute a statically linked netdata x86_64 binary, including key dependencies (like bash, curl, etc) that can run everywhere a Linux kernel runs (CoreOS, CirrOS, etc).

  3. metrics streaming and replication has been improved significantly. All known issues have been solved and key enhancements have been added. headless collectors and proxies can now send metrics to backends when data source = as collected.

  4. backends have got quite a few enhancements, including host tags, metrics filtering at the netdata side and sending of chart and dimension names instread of IDs; prometheus support has been re-written to utilize more prometheus features and provide more flexibility and integration options. IF YOU UPDATE FROM NETDATA 1.6 PLEASE CHECK YOUR DASHBOARDS, SINCE MANY METRICS HAVE CHANGED NAMES.

  5. netdata now monitors ZFS (on Linux and FreeBSD), ElasticSearch, RabbitMQ, Go applications (via expvar), ipfw (on FreeBSD 11), samba, squid logs (with web_log plugin!).

  6. netdata dashboard loading times have been improved significantly (hit F5 a few times on a netdata dashboard - it is now amazingly fast), to support dashboards with thousands of charts.

  7. netdata alarms now support custom hooks, so you can run whatever you like in parallel with netdata alarms.

  8. As usual, this release brings dozens more improvements, enhancements and compatibility fixes.

netdata is now a fully featured statsd server

netdata is now a fully featured statsd server. It can collect statsd formatted metrics, visualize them on its dashboards, stream them to other netdata servers or archive them to backend time-series databases.

netdata statsd is fast. It can collect more than 1.200.000 metrics per second on modern hardware, more than 200Mbps of sustained statsd traffic. netdata statsd is inside netdata. This provides a distributed statsd implementation.

netdata also supports statsd synthetic charts: You can create dedicated sections on the dashboard to render the charts. You can control everything: the main menu, the submenus, the charts, the dimensions on each chart, etc.

Read more about netdata statsd

counters

  • Scope: count the events of something (e.g. number of file downloads)
  • Format: name:INTEGER|c or name:INTEGER|C or name|c
  • statsd increments the counter by the INTEGER number supplied (positive, or negative).

image

gauges

  • Scope: report the value of something (e.g. cache memory used by the application server)
  • Format: name:FLOAT|g
  • statsd remembers the last value supplied, and can increment or decrement the latest value if FLOAT begins with + or -.

image

histograms

  • Scope: statistics on a size of events (e.g. statistics on the sizes of files downloaded)
  • Format: name:FLOAT|h
  • statsd maintains a list of all the values supplied and provides statistics on them.

image

The same chart with sum unselected, to show the detail of the dimensions supported:
image

meters

This is identical to counter.

  • Scope: count the events of something (e.g. number of file downloads)
  • Format: name:INTEGER|m or name|m or just name
  • statsd increments the counter by the INTEGER number supplied (positive, or negative).

image

sets

  • Scope: count the unique occurrences of something (e.g. unique filenames downloaded, or unique users that downloaded files)
  • Format: name:TEXT|s
  • statsd maintains a unique index of all values supplied, and reports the unique entries in it.

image

timers

  • Scope: statistics on the duration of events (e.g. statistics for the duration of file downloads)
  • Format: name:FLOAT|ms
  • statsd maintains a list of all the values supplied and provides statistics on them.

image

The same chart with the sum unselected:
image


dashboard improvements

There have been significant optimizations to the loading times of the dashboard. The dashboard loads instantly now, even when there are several hundreds of charts in it (hit F5 on the dashboard - it is super fast).

For those who know: we eliminated most browser reflows, by refactoring the way the charts are initialized and splitting initialization in 2 phases. Unfortunately we had to re-shape gauge and easypiecharts, so pay some attention to your custom dashboards after updating.

We now use natural sorting on the dashboard elements (i.e. instead of 1, 10, 2, 3 we get 1, 2, 3, 10).

There have been dozens of performance improvements on the netdata dashboard. Like all the previous releases, this release makes netdata the fastest netdata so far!

new installation methods

  • Single line installation on Linux
  • Static 64bit packages for Linux
  • Improved support for Red Hat Enterprise Linux @racciari,
  • Improved support for Amazon Machine Image
  • Improved support for Centos @n0coast
  • Many more installer/updater improvements @nielsAD, @mfurlend

Streaming

  • improved self cleanup of obsolete charts and hosts at a central netdata.
  • host tags are now propagated from netdata to netdata while streaming metrics.
  • log error when multiple clients are streaming the metrics of the same host.
  • dozens more streaming improvements and bugfixes.

Backends

  • New prometheus backend, supporting all the features of the others backends netdata supports. The new format changed the names of metrics, so if you use grafana or other tools you will have to update your queries.
  • Prometheus and opentsdb now support host tags (advanced ephemeral nodes monitoring)
  • Metrics sent to backends with data source average, sum or volume (from the netdata database) are now more accurate.
  • Added contrib/nc-backend.sh, a script that can act as a fallback backend for graphite, opentsdb and compatibles.
  • netdata nodes without a database (slaves and proxies) can now send as collected metrics to backends.

New and improved plugins

  • Go apps monitoring via expvar ! @kralewitz
  • ElasticSearch monitoring ! @l2isbad
  • RabbitMQ monitoring ! @l2isbad
  • ipfw monitoring under FreeBSD 11 ! @vlvkobal
  • ZFS monitoring under FreeBSD (@vlvkobal) and Linux !
  • samba monitoring ! @ntlug
  • web_log plugin can now monitor squid logs too ! @l2isbad
  • web_log plugin can now monitor apache cache logs too (removed old apache_cache plugin) @l2isbad
  • many more web_log improvements - web_log is now a lot more powerful! @l2isbad
  • python.d.plugin LogService now supports monitoring web log files matching a pattern @l2isbad
  • disk monitoring under Linux now utilizes /dev/mapper names. It also has improved docker compatibility.
  • haproxy improvements @l2isbad
  • dns_query_time plugin to monitor the response time of nameservers @l2isbad
  • Fronius Solar @BrainDoctor
  • better support for monitoring Proxmox/qemu @efaden and libvirt/qemu VMs
  • cpufreq improvements @l2isbad
  • smartd_log improvements @pkoenig10
  • bind_rndc rewritten @l2isbad
  • lighttpd improvements (part of the apache plugin)
  • isc_dhcpd improvements @l2isbad
  • fping improvements
  • apps.plugin improvements (added many more applications to monitor, notably hadoop and friends, improved compatibility)
  • freeipmi improvements
  • mdstat improvements @l2isbad
  • mysql improvements @alibo
  • redis improvements @l2isbad
  • postgres rds fixes @facetoe
  • fail2ban improvements @l2isbad
  • idlejitter rewritten
  • openvpn improvements @l2isbad
  • numa improvements @Benje06

New and improved alarms

  • alarm-notify.sh now supports custom notification methods (you can hook whatever you like to netdata alarms).
  • email notifications are now multipart (have both HTML and text versions in them)
  • low memory alarm now excludes ZFS ARC.
  • improved discord notifications.
  • improved telegraf notifications @alibo
  • lighttpd alarm
  • mongodb alarm @jnogol

Other improvements

  • memory mode ram utilizes KSM (kernel memory deduper).
  • many memory mode map improvements for faster operation with huge databases.
  • netdata is now even faster on FreeBSD, thank to several optimization made by @vlvkobal
  • netdata can now be compiled with clang, even on FreeBSD
  • netdata can now be compiled on FreeBSD 10.3
netdata - v1.6.0

Published by philwhineray over 7 years ago

New to netdata? Check its demo: https://my-netdata.io

User Base Monitored Servers Sessions Served

New Users Today New Machines Today Sessions Today

Release announced on twitter, hacker news, reddit r/linux, reddit r/sysadmin, reddit r/linuxadmin, reddit r/freebsd reddit r/devops reddir r/homelab facebook

birthday release: 1 year netdata

netdata was first published on March 30th, 2016.
It has been a crazy year since then:

Central netdata is here!

This is the first release that supports real-time streaming of metrics between netdata servers.

netdata can now be:

  • autonomous host monitoring (like it always has been)
  • headless data collector (collect and stream metrics in real-time to another netdata)
  • headless proxy (collect metrics from multiple netdata and stream them to another netdata)
  • store and forward proxy (like headless proxy, but with a local database)
  • central database (metrics from multiple hosts are aggregated)

metrics databases can be configured on all nodes and each node maintaining a database may have a different retention policy and possibly run (even different) alarms on them.

There are 4 settings that control what netdata can be:

  1. [global].memory mode in netdata.conf, controls if a netdata will maintain a local database and the type of it. For more information check Running a dedicated central netdata server.

  2. [web].mode in netdata.conf, controls if netdata will expose its API, and the type of web server to enable (single or multi-threaded). Check netdata.conf configuration for streaming.

  3. [stream].enabled in stream.conf, controls if netdata will stream its metrics to another netdata. Check stream.conf for sending metrics.

  4. [API KEY].enabled in stream.conf, controls if netdata will accept metrics from other netdata. Check stream.conf for receiving metrics.

Using the above, we support a lot of different configurations, like these:

target memorymode webmode streamenabled send tobackend localalarms localdashboard
headless collector none none yes not possible not possible no
headless proxy none not none yes not possible not possible no
proxy with db not none not none yes possible possible yes
central netdata not none not none no possible possible yes

monitoring ephemeral nodes

netdata now supports monitoring autoscaled ephemeral nodes, that are started and stopped on demand (their IP is not known).

When the ephemeral nodes start streaming metrics to the central netdata, the central netdata will show register them at my-netdata menu on the dashboard, like this:

You can see this live at https://build.my-netdata.io (this server may not always be available for demo).

For more information check: monitoring ephemeral nodes.

monitoring ephemeral containers and VM guests

netdata now cleans up container, guest VM, network interfaces and mounted disk metrics, disabling automatically their alarms too.

For more information check monitoring ephemeral containers.

apps.plugin ported for FreeBSD

Vladimir Kobal has ported apps.plugin to FreeBSD.

netdata can now provide Applications, Users and User Groups under FreeBSD too:

Also, the CPU utilization of netdata under FreeBSD, is now a lot less compared to netdata v1.5.

See it live at our FreeBSD demo server.

web_log plugin

Ilya Mashchenko has done a wonderful job creating a unified web log parsing plugin for all kinds of web server logs. With it, netdata provides real-time performance information and health monitoring alarms for web applications and web sites!

Requests by http status:
image

Requests by http status code family:
image

Requests by http status code:
image

Requests bandwidth:
image

Requests timings:
image

URL patterns of interest (you configure the patterns):
image

Requests by http method:
image

Requests by IP version:
image

Number of unique clients:
image

and a lot more, including alarms:

alarm description minimumrequests warning critical
1m_redirects The ratio of HTTP redirects (3xx except 304) over all the requests, during the last minute. Detects if the site or the web API is suffering from too many or circular redirects. (i.e. oops! this should not redirect clients to itself) 120/min > 20% > 30%
1m_bad_requests The ratio of HTTP bad requests (4xx) over all the requests, during the last minute. Detects if the site or the web API is receiving too many bad requests, including 404, not found. (i.e. oops! a few files were not uploaded) 120/min > 30% > 50%
1m_internal_errors The ratio of HTTP internal server errors (5xx), over all the requests, during the last minute. Detects if the site is facing difficulties to serve requests. (i.e. oops! this release crashes too much) 120/min > 2% > 5%
5m_requests_ratio The percentage of successful web requests of the last 5 minutes, compared with the previous 5 minutes. Detects if the site or the web API is suddenly getting too many or too few requests. (i.e. too many = oops! we are under attack)(i.e. too few = oops! call the network guys) 120/5min > double or < half > 4x or < 1/4x
web_slow The average time to respond to requests, over the last 1 minute, compared to the average of last 10 minutes. Detects if the site or the web API is suddenly a lot slower. (i.e. oops! the database is slow again) 120/min > 2x > 4x
1m_successful The ratio of successful HTTP responses (1xx, 2xx, 304) over all the requests, during the last minute. Detects if the site or the web API is performing within limits. (i.e. oops! help us God!) 120/min < 85% < 75%

For more information check: the spectacles of a web server log file.

backends

netdata can now archive metrics to JSON backends (both push, by @lfdominguez, and pull modes).

IPMI monitoring

netdata now has an IPMI plugin (based on freeipmi) for monitoring server hardware.

The plugin creates (up to) 8 charts, based on the information collected from IPMI:

  1. number of sensors by state
  2. number of events in SEL
  3. Temperatures CELCIUS
  4. Temperatures FAHRENHEIT
  5. Voltages
  6. Currents
  7. Power
  8. Fans

It also supports alarms (including the number of sensors in critical state):

image

For more information, check monitoring IPMI.

New Plugins

Ilya Mashchenko builds python data collection plugins for netdata at an wonderfull rate! He rocks!

  • web_log for monitoring in real-time all kinds of web server log files @l2isbad
  • freeipmi for monitoring IPMI (server hardware)
  • nsd (the name server daemon) @383c57
  • mongodb @l2isbad
  • smartd_log (monitoring disk S.M.A.R.T. values) @l2isbad

Improved Plugins

  • nfacct reworked and now collects connection tracker information using netlink.
  • ElasticSearch re-worked @l2isbad
  • mysql re-worked to allow faster development of custom mysql based plugins (MySQLService) @l2isbad
  • SNMP
  • tomcat @NMcCloud
  • ap (monitoring hostapd access points)
  • php_fpm @l2isbad
  • postgres @l2isbad
  • isc_dhcpd @l2isbad
  • bind_rndc @l2isbad
  • numa
  • apps.plugin improvements and freebsd support @vlvkobal
  • fail2ban @l2isbad
  • freeradius @l2isbad
  • nut (monitoring UPSes)
  • tc (Linux QoS) now works on qdiscs instead of classes for the same result (a lot faster) @t-h-e
  • varnish @l2isbad

New and Improved Alarms

  • web_log, many alarms to detect common web site/API issues
  • fping, alarms to detect packet loss, disconnects and unusually high latency
  • cpu, cpu utilization alarm now ignores nice

New and improved alarm notification methods

  • HipChat to allow hosted HipChat @frei-style
  • discordapp @lowfive

Dashboard Improvements

  • dashboard now works on HiDPi screens
  • dashboard now shows version of netdata
  • dashboard now resets charts properly
  • dashboard updated to use latest gauge.js release

Other Improvements

  • thanks to @rlefevre netdata now uses a lot of different high resolution system clocks.

netdata has received a lot more improvements from many more contributors! (it was really a lot of work to dig into git log to collect all the above, so forgive me if I forgot to mention a few contributions and contributors).

Thank you all!

netdata - v1.5.0

Published by ktsaou over 7 years ago

New to netdata? Check its demo: http://my-netdata.io

User Base Monitored Servers Sessions Served

New Users Today New Machines Today Sessions Today

Release announced on twitter, hacker news, reddit r/linux, reddit r/sysadmin, reddit r/linuxadmin, reddit r/freebsd

Yet another release that makes netdata the fastest netdata ever!

This is probably the release with the largest changeset so far. A lot of work, by a lot of people made this release possible!

FreeBSD, MacOS and FreeNAS

Vladimir Kobal has done a magnificent work porting netdata to FreeBSD and MacOS.

Everything works:

  • cpu and interrupts, memory, disks (performance and space monitoring)
  • network interfaces and softnet
  • IPv4 and IPv6 metrics
  • processes and context switches
  • IPC (queues, semaphores, shared memory)
  • and of course all the netdata external plugins

Wow! Check it live on FreeBSD, at https://freebsd.my-netdata.io/

Backends

netdata supports data archiving to backend databases:

  • Graphite
  • OpenTSDB
  • Prometheus

and of course all the compatible ones (KairosDB, InfluxDB, Blueflood, etc)

image

With this feature netdata can interface with your existing devops infrastructure and allow you to visualize its metrics with other tools, like grafana.

New Plugins

Ilya Mashchenko has created most of the python data collection plugins in this release! He rocks!

  • Systemd Services (real-time monitoring of the resource utilization of all systemd services, using cgroups!)
  • FPing (network latency and jitter monitoring with netdata!)
  • Postgres databases @facetoe, @moumoul
  • Vanish disk cache (v3 and v4) @l2isbad
  • ElasticSearch @l2isbad
  • HAproxy @l2isbad
  • FreeRadius @l2isbad, @lgz
  • mdstat (RAID) @l2isbad
  • ISC bind (via rndc) @l2isbad
  • ISC dhcpd @l2isbad, @lgz
  • Fail2Ban @l2isbad
  • OpenVPN status log @l2isbad, @lgz
  • NUMA memory @tycho
  • CPU Idle States @tycho
  • gunicorn @deltaskelta
  • ECC memory hardware errors
  • IPC semaphores
  • uptime ( with a nice badge too: uptime badge )

Improved Plugins

  • netfilter conntrack
  • MySQL/MariaDB (replication) @l2isbad
  • ipfs @pjz
  • cpufreq @tycho
  • hddtemp @l2isbad
  • sensors @l2isbad
  • nginx @leolovenet
  • nginx_log @paulfantom
  • phpfpm @leolovenet
  • redis @leolovenet
  • dovecot @justohall
  • cgroups
  • disk space
  • apps.plugin
  • /proc/interrupts @rlefevre
  • /proc/softirqs @rlefevre
  • /proc/vmstat (system memory charts)
  • /proc/net/snmp6 (IPv6 charts)
  • /proc/self/meminfo (system memory charts)
  • /proc/net/dev (network interfaces)
  • tc (linux QoS)

New and Improved Alarms

  • MySQL/MariaDB alarms (incl. replication)
  • IPFS alarms
  • HAproxy alarms
  • UDP buffer alarms
  • TCP AttemptFails
  • ECC memory alarms
  • netfilter connections alarms

New Alarm Notification Methods

  • messagebird.com @tech-no-logical
  • pagerduty.com @jimcooley
  • pushbullet.com @tperalta82
  • twilio.com @shadycuz
  • HipChat
  • kafka

Shell Integration

Shell scripts can now query netdata easily!

eval "$(curl -s 'http://localhost:19999/api/v1/allmetrics')"

after this command, all the netdata metrics are exposed to shell. Check:

# source the metrics
eval "$(curl -s 'http://localhost:19999/api/v1/allmetrics')"

# let's see if there are variables exposed by netdata for system.cpu
set | grep "^NETDATA_SYSTEM_CPU"

NETDATA_SYSTEM_CPU_GUEST=0
NETDATA_SYSTEM_CPU_GUEST_NICE=0
NETDATA_SYSTEM_CPU_IDLE=95
NETDATA_SYSTEM_CPU_IOWAIT=0
NETDATA_SYSTEM_CPU_IRQ=0
NETDATA_SYSTEM_CPU_NICE=0
NETDATA_SYSTEM_CPU_SOFTIRQ=0
NETDATA_SYSTEM_CPU_STEAL=0
NETDATA_SYSTEM_CPU_SYSTEM=1
NETDATA_SYSTEM_CPU_USER=4
NETDATA_SYSTEM_CPU_VISIBLETOTAL=5

# let's see the total cpu utilization of the system
echo ${NETDATA_SYSTEM_CPU_VISIBLETOTAL}
5

# what about alarms?
set | grep "^NETDATA_ALARM_SYSTEM_SWAP_"
NETDATA_ALARM_SYSTEM_SWAP_RAM_IN_SWAP_STATUS=CRITICAL
NETDATA_ALARM_SYSTEM_SWAP_RAM_IN_SWAP_VALUE=53
NETDATA_ALARM_SYSTEM_SWAP_USED_SWAP_STATUS=CLEAR
NETDATA_ALARM_SYSTEM_SWAP_USED_SWAP_VALUE=51

# let's get the current status of the alarm 'ram in swap'
echo ${NETDATA_ALARM_SYSTEM_SWAP_RAM_IN_SWAP_STATUS}
CRITICAL

# is it fast?
time curl -s 'http://localhost:19999/api/v1/allmetrics' >/dev/null

real  0m0,070s
user  0m0,000s
sys   0m0,007s

# it is...
# 0.07 seconds for curl to be loaded, connect to netdata and fetch the response back...

The _VISIBLETOTAL variable sums up all the dimensions of each chart.

The format of the variables is:

NETDATA_${chart_id^^}_${dimension_id^^}="${value}"

The value is rounded to the closest integer, since shell script cannot process decimal numbers.

Dashboard Improvements

  • dashboard is now faster on firefox, safari, opera, edge (edge is still the slowest)
  • dashboard charts legends now have bigger fonts
  • SHIFT + mousewheel to zoom charts, works on all browsers
  • perfect-scrollbar on the dashboard
  • dashboard 4K resolution fixes
  • dashboard compatibility fixes for embedding charts in third party web sites
  • charts on custom dashboards can have common min/max even if they come from different netdata servers
  • alarm log is now saved and loaded back so that the alarm history is available at the dashboard

Other Improvements

  • python.d.plugin has received way to many improvements from many contributors!
  • charts.d.plugin can now be forked to support multiple independent instances
  • registry has been re-factored to lower its memory requirements (required for the public registry)
  • simple patterns in cgroups, disks and alarms
  • netdata-installer.sh can now correctly install netdata in containers
  • supplied logrotate script compatibility fixes
  • spec cleanup @breed808
  • clocks and timers reworked @rlefevre

netdata has received a lot more improvements from many more contributors! (it was really a lot of work to dig into git log to collect all the above, so forgive me if I forgot to mention a few contributions and contributors).

Thank you all!

netdata - v1.4.0

Published by ktsaou about 8 years ago

New to netdata? Check its demo: http://my-netdata.io

User Base Monitored Servers Sessions Served

New Users Today New Machines Today Sessions Today

Release announced on Hacker News
Release announced on reddit r/linux
Release announced on reddit r/sysadmin
Release announced on twitter

At a glance

  • the fastest netdata ever (with a better look too)!
  • improved IoT and containers support!
  • alarms improved in almost every way!
  • new plugins:
    • softnet netdev,
    • extended TCP metrics,
    • UDPLite
    • NFS v2, v3 client (server was there already),
    • NFS v4 server & client,
    • APCUPSd,
    • RetroShare
  • improved plugins:
    • mysql,
    • cgroups,
    • hddtemp,
    • sensors,
    • phpfm,
    • tc (QoS)

In detail

improved alarms!

Many new alarms have been added to detect common kernel configuration errors and old alarms have been re-worked to avoid notification floods.

Alarms now support:

  • notification hysteresis (both static and dynamic)

    image

  • notification self-cancellation, and

  • dynamic thresholds based on current alarm status

    image

Also, a new alarms log:

image

improved alarm notifications

netdata now supports:

  • email notifications
  • slack.com notifications on slack channels
  • pushover.net notifications (mobile push notifications)
  • telegram.org notifications

For all the above methods, netdata supports role-based notifications, with multiple recipients for each role and severity filtering per recipient!

Also, netdata support HTML5 notifications, while the dashboard is open in a browser window (no need to be the active one).

image

All notifications (HTML5, emails, slack, pushover, telegram) are now clickable to get to the chart that raised the alarm.

other improvements

  • improved IoT support!

    netdata builds and runs with musl libc and runs on systems based on busybox.

  • improved containers support!

    netdata runs on alpine linux (a low profile linux distribution used in containers).

  • Dozens of other improvements and bugfixes


netdata 1.4.0 - download release tarfiles from http://firehol.org/download/netdata/releases/v1.4.0

netdata - v1.3.0

Published by ktsaou about 8 years ago

New to netdata? Check its demo: http://my-netdata.io

User Base Monitored Servers Sessions Served

New Users Today New Machines Today Sessions Today

At a glance

  1. netdata has health monitoring / alarms!
  2. netdata generates badges that can be embeded anywhere!
  3. netdata plugins are now written in python!
  4. new plugins: redis, memcached, nginx_log, ipfs, apache_cache

IMPORTANT:
Since netdata now uses python plugins, new packages are
required to be installed on a system to allow it work.
For more information, please check the installation page.

In detail

netdata has alarms!

Based on the POLL we made on github, health monitoring was the winner. So here it is!

netdata now has a powerful health monitoring system embedded.

image

netdata has badges!

netdata can generate badges with live information from the collected metrics.

netdata plugins are now written in python!

Thanks to the great work of Paweł Krupa (@paulfantom), most BASH plugins have been ported to python.

The new python.d.plugin supports both python2 and python3 and data collection from multiple sources for all modules.

The following pre-existing modules have been ported to python:

  • apache
  • cpufreq
  • example
  • exim
  • hddtemp
  • mysql
  • nginx
  • phpfm
  • postfix
  • sensors
  • squid
  • tomcat

The following new modules have been added:

  • apache_cache
  • dovecot
  • ipfs
  • memcached
  • nginx_log
  • redis

other data collectors

Thanks to @simonnagl netdata now reports disk space usage.

other improvements

  • dashboards now transfer certain settings from server to server when changing servers via the my-netdata menu.

    The settings transferred are the dashboard theme, the online help status and current pan and zoom timeframe of the dashboard.

  • API improvements:

    • reduction functions now support 'min', 'sum' and 'incremental-sum'.
    • netdata now offers a multi-threaded and a single threaded web server (single threaded is better for IoT).
  • apps.plugin improvements:

    • can now run with command line argument 'without-files' to prevent it from enumating all the open files/sockets/pipes of all running processes.
    • apps.plugin now scales the collected values to match the
      the total system usage.
    • apps.plugin can now report guest CPU usage per process.
    • repeating errors are now logged once per process.
  • netdata now runs with IDLE process priority (lower than nice 19)

  • netdata now instructs the kernel to kill it first when it starves for memory.

  • netdata listens for signals:

    • SIGHUP to netdata instructs it to re-open its log files (new logrotate file added too).
    • SIGUSR1 to netdata saves the database
    • SIGUSR2 to netdata reloads health / alarms configuration
  • netdata can now bind to multiple IPs and ports.

  • netdata now has new systemd service file (it starts as user netdata and does not fork).

  • Dozens of other improvements and bugfixes

netdata 1.3.0 - download release tarfiles from http://firehol.org/download/netdata/releases/v1.3.0

netdata - v1.2.0

Published by ktsaou over 8 years ago

Netdata demo sites: http://my-netdata.io

At a glance

  1. netdata now is 30% faster !
  2. netdata now has a registry (my-netdata dashboard menu) !
  3. netdata now monitors Linux Containers (cgroups, docker, lxc, etc) !

IMPORTANT:
This version requires libuuid. The package you need to build netdata is:

  • uuid-dev (debian/ubuntu), or
  • libuuid-devel (centos/fedora/redhat)

In detail

netdata is now 30% faster !

  • Patches submitted by @fredericopissarra improved overall netdata performance by 10%.
  • A new improved search function in the internal indexes made all searches faster by 50%, resulting in about 20% better performance for the core of netdata.
  • More efficient threads locking in key components contributed to the overall speed up.

netdata now has a central registry !

The central registry tracks all your netdata servers and bookmarks them for you at the my-netdata menu on all dashboards.

Every netdata can act as a registry, but there is also a global registry provided for free for all netdata users!

netdata now monitors Linux Containers !

docker, lxc, or anything else. For each container it monitors CPU, RAM, DISK I/O (network interfaces were already monitored).

Other improvements

  • apps.plugin: now uses linux capabilities by default without setuid to root
  • netdata has now an improved signal handler thanks to @simonnagl
  • API: new improved CORS support
  • SNMP: counter64 support fixed
  • MYSQL: more charts, about QCache, MyISAM key cache, InnoDB buffer pools, open files
  • DISK charts now show mount point when available
  • Dashboard: improved support for older web browsers and mobile web browsers (thanks to @simonnagl)
  • Multi-server dashboards now allow de-coupled refreshes for each chart, so that if one netdata has a network latency the other charts are not affected
  • Dozens of other improvements, optimizations and bug-fixes.

netdata 1.2.0 - download release tarfiles also from http://firehol.org/download/netdata/releases/v1.2.0

netdata - v1.1.0

Published by ktsaou over 8 years ago

netdata 1.1.0 - download release tarfiles from http://firehol.org/download/netdata/releases/v1.1.0

Dozens of commits that improve netdata in several ways:

Data collection

  • added IPv6 monitoring
  • added SYNPROXY DDoS protection monitoring
  • apps.plugin: added charts for users and user groups
  • apps.plugin: grouping of processes now support patterns
  • apps.plugin: now it is faster, after the new features added
  • better auto-detection of partitions for disk monitoring
  • better fireqos intergation for QoS monitoring
  • squid monitoring now uses squidclient
  • SNMP monitoring now supports 64bit counters

API

  • fixed issues in CSV output generation
  • netdata can now be restricted to listen on a specific IP (API and web server)

Core

  • added error log flood protection

Web Dashboard

  • better error handling when the netdata server is unreachable
  • each chart now has a toolbox
  • on-line help support
  • check for netdata updates button
  • added example /tv.html dashboard
  • now compiles with musl libc (alpine linux)

Packaging

  • added debian packaging
  • support non-root installations
  • the installer generates uninstall script
netdata - v1.0.0

Published by ktsaou over 8 years ago

netdata 1.0.0 - download release tarfiles from http://firehol.org/download/netdata/releases/v1.0.0

netdata - netdata v1.0rc

Published by ktsaou over 8 years ago

netdata - Stable release v0.2

Published by ktsaou about 9 years ago