Posted 14 Dec 2021 in Linux

Reconsidering Netdata

Reconsidering where and how I install the open-source system monitoring tool, Netdata.

⚠️

This post was automatically migrated from my old blogging software, and I have not reviewed it for problems yet. Please contact me if you notice any important issues.

Since I learned about the open-source server monitoring tool Netdata in — I would guess — 2017, I’ve installed it on pretty much every Linux machine I maintain. That included not just web servers, but database & search servers, my NAS, and various Raspberry Pis^[1] here at home.

This has changed in recent months; this post explains how and why.

Netdata’s Installer/Updater Script

The recommended method of installing Netdata — and to the best of my recollection the only method years ago when I started using it — is their “automatic one-line installation script,” which is a long shell script that downloads the source for Netdata and its dependencies, builds everything locally, and installs it.^[2]

But the build process has recently (I noticed it several months ago) become much more resource-intensive. It requires a lot of CPU time, and on systems that put /tmp in RAM to avoid excessive disk wear, it requires a lot of RAM — hundreds of megabytes.^[3] (You can find a few discussions of this in Netdata’s GitHub Issues.)

And by default, Netdata updates itself automatically, meaning this build process happens fairly often, whenever Netdata releases an update.

Resource Consumption Problems

A couple months ago, I noticed that every once in a while, one of my ARM computers here at home (an Orange Pi PC, running Armbian) would drop off the network. The system was powered on and would reply to pings, but nothing else — I couldn’t SSH into it, and it wasn’t serving network requests as it normally should.

I went crazy and spent, cumulatively, several hours trying to track this down. I set up remote logging so I could review the system logs the next time it went down, but I didn’t find anything useful there. I disabled my “reboot on network loss” cron job^[4], to make sure it wasn’t that.

Was it a hardware failure? I reinstalled the OS on a fresh SD card, installed my application and other stuff — including Netdata — and waited. Sure enough, it went down after a few days, ruling out an SD card issue.

Next, I tried replacing the Orange Pi itself, with an identical spare. The Pi locked up and dropped my SSH connection while I was setting up, and this time I made a critical mental connection: it had died while installing Netdata.

What had happened was: Armbian has /tmp mounted to a RAM disk by default, and while building protobuf (for Java, of all things; why does Netdata need to build that‽) the Netdata installer script was writing a crazy amount of data to /tmp. This effectively killed the system.

I think this specific case has to do with the clever (overly clever?) way Armbian handles swap. To the best of my recollection, it allocates a zram-backed swap partition in memory. This effectively tiers the system's RAM into two parts: fast+uncompressed and slower+compressed. I think the Netdata updater killed this system by filling up /tmp, which filled up a lot of RAM, causing the system to "swap" that memory to the compressed half of RAM being used as "swap.” Something in this process went awry, and the system became unresponsive. My theory is that this was due to one, of both, of these factors:

The system’s CPU is older and slower; it may have been overwhelmed handling the protobuf build process and compressing/uncompressing the zram partition in memory
Maybe the OOM killer doesn’t know how to deal with, or can’t deal with, tmpfs, zram, or both?

In any case, I can’t have servers going down randomly because Netdata pushed an update. (In addition to this computer, it’s taken down another local Raspberry Pi, and produced plenty of low-memory warnings on my web server.) That’s unacceptable, and I don’t understand how the Netdata organization could think this is a tenable state of affairs.

Clutter

I was dismayed recently to list / on one of my servers and find a hundred or so netdata-updater.log.xxxxx files. I have absolutely no idea why anyone would decide that writing a bunch of logs directly to / is a good idea.

This is easy enough to tidy up. As root:

Create a file at nano /etc/cron.daily/clean-netdata-update-logs, with the following contents:

#!/bin/sh
find "/" -maxdepth 1 -mtime +7 -name "netdata-updater.log.*" -delete

Then chmod +x /etc/cron.daily/clean-netdata-update-logs. This will schedule a daily job to cleanup all netdata-updater.log.xxxxx files older than one week.

Other Issues

At various points, Netdata has pushed out updates that break the updater and/or Netdata itself, requiring manual intervention. And, currently, I’m getting daily emails from a couple servers due to a bug in the updater script.

Again: not acceptable.

Solutions

These issues are mainly with the installer script and not Netdata itself. Thankfully, there are now other ways to install Netdata. They have another script which installs a static binary, but I would prefer not to rely on buggy installer scripts. They also now have official deb and rpm packages hosted on PackageCloud.

So, I’ve adopted the following rough guidelines:

Consider whether a given host will actually benefit from Netdata monitoring. This is a cost/benefit calculation for me, since my earlier assumption that installing Netdata will do no harm turned out to be false. In many cases recently (for example, on Pis at home running trivial workloads) I’ve simply decided not to install Netdata (or to uninstall it; thankfully, Netdata does have a reliable uninstallation process).
Install Netdata using the pre-built packages from their PackageCloud repository, not the installer script.

If, for some reason, you are still considering using the installer script:

Absolutely do not use it on anything with a weak CPU or less than 2GB of memory. (My personal preference, depending on how the system is configured, would be 4GB.)
Implement the cleanup cronjob for /netdata-updater.log.* as described above.

Installing from PackageCloud repository

To install Netdata from PackageCloud:

curl -s https://packagecloud.io/install/repositories/netdata/netdata/script.deb.sh | sudo bash
sudo apt install netdata

That process has worked for me, among other systems, on a Raspberry Pi running Raspberry Pi OS Bullseye, but not on one running the older Pi OS Stretch. YMMV.

Migrating to pre-built packages

I’ve now done this on a couple machines, and these are my notes, but note that they may not be complete, and they may not apply to your system. Take a backup of your Netdata configuration first.

As root:

mkdir -p ~/bak
cp -R /etc/netdata ~/bak/
tree -a ~/bak/netdata  # Confirm that your Netdata configuration has been copied here.

Refer to the uninstallation directions regarding the .environment file. Confirm that it exists at /etc/netdata/.environment; otherwise, replace /etc/netdata/.environment in the following step by its location on your system:

cat /etc/netdata/.environment

# Replace NETDATA_PREFIX in the next line by the prefix from the environment file (often it's an empty string)
# Replace /etc/netdata/.environment in the next line by the path to your environment file, verified above
${NETDATA_PREFIX}/usr/libexec/netdata/netdata-uninstaller.sh --yes --env /etc/netdata/.environment

Netdata is now uninstalled. Next, we’ll install it from PackageCloud:

curl -s https://packagecloud.io/install/repositories/netdata/netdata/script.deb.sh | bash
apt install netdata

systemctl status netdata  # Verify Netdata has been installed properly
systemctl stop netdata    # Stop Netdata while we configure it

Compare the old configuration to the new one, to see what all will be migrated; remove items that don't need to be migrated:

tree -a /etc/netdata
tree -a ~/bak/netdata
rm ~/bak/netdata/orig ~/bak/netdata/.environment ~/bak/netdata/.install-type ~/bak/netdata/.installer-cleanup-of-stock-configs-done

And finally, restore the configuration and start Netdata:

mv /etc/netdata ~/bak/netdata.new
mv ~/bak/netdata /etc
ls -l /etc/netdata/  # Verify the old configuration is in place

systemctl start netdata
systemctl status netdata

On one machine, after this migration, I ran into a permissions issue when trying to view Netdata web console. A fix from this GitHub comment worked for me. Edit /etc/netdata/netdata.conf to include these lines, and restart Netdata (systemctl restart netdata):

web files owner = root
web files group = netdata

On hosts which use the fping plugin, I had to resolve another permissions issue:

chmod 644 /etc/netdata/fping.conf
systemctl restart netdata

Finally, if you use Netdata Cloud, this is effectively a new node. Remember to claim it.

Common Configuration Settings

To conclude this post, I’d like to share some settings I adjust on most Netdata installs. In /etc/netdata/netdata.conf:

[global]
	access log = none
	debug log = none
	enable web responses gzip compression = no
	disconnect idle web clients after seconds = 3600

[web]
	bind to = <private IP, or * on local network firewalled from Internet>
	disconnect idle clients after seconds = 3600
	enable gzip compression = no

bind to needs adjustment to allow access remotely; it should be configured according to your specific network & system setup, with security in mind.

The access log and debug log settings are intended to reduce Netdata’s resource consumption on the server. The settings related to gzip compression and disconnect idle clients are applied when I'm proxying Netdata's web console through nginx, which handles compression and etc. For more on optimizing Netdata installations, you may refer to the Netdata documentation page on the subject.

It might seem odd to complain about how this tool works on a Raspberry Pi, but Netdata specifically encourages people to run it on Pis, so… ↩︎
The installation script will install some dependencies via the host’s package manager; others, like protobuf, it fetches and builds independently. ↩︎
In my experience, this seems to be largely related to Netdata’s relatively-new usage of protobuf. Building that takes an insane amount of CPU time, and hundreds of megabytes of disk space/RAM — untenable on a Raspberry Pi. ↩︎
I run this script on most Pis here, particularly Pi Zero Ws, as I’ve frequently had problems with them dropping off the wireless network for no particular reason after days or weeks of uptime. I’ll blog about it eventually, in the Raspberry Pi Reliability series. This particular computer is connected via Ethernet, so putting this script on it was due to habit, not necessity. ↩︎