One of my jobs at work is to help maintain our monitoring infrastructure. When I started, it was based purely on OpenNMS; we’ve found that it did not support what we needed for problem acknowledgement, outage scheduling, and various other routine tasks. I’ve supplemented Nagios, which I used extensively at a previous employer, and it’s working out well – it’s just quite a bit of work to maintain the config files, and there is a bit of a learning curve for people who haven’t used it before to learn how to configure it. However, I’m always interested in looking at new solutions, and the opportunity to demo a few recently came my way. One of my coworkers recently attended LinuxWorld, and saw that there were quite a few monitoring companies. He passed a couple of them my name, and asked them to contact me so we could take a look at their products. My employer is perfectly willing to pay hard cash for monitoring tools, if they have enough of an advantage over the open source products. Two of the companies have contacted me so far – IT Groundwork and Heroix. Both products are rather interesting. IT Groundwork is selling a suite of products based on open source solutions, backed by commercial support. Heroix is pushing their new “agentless” monitoring solution, which is called Longitude. I’ve gone through an online demonstration of each product, accompanied by my manager and one of my fellow systems administrators. Let me detail what I thought of each solution.
First, I’ll discuss my impressions of IT Groundwork’s product suite. They are using Nagios on the backend (which I like!), but have done a lot of work with it. They’ve added a database layer in the middle, where configuration and statistics information is stored (this is based on MySQL.) They have also implemented a totally new GUI for Nagios, with a more streamlined status viewer (you can see screenshots here), a web-based configuration tool with lots of support for profiles and such, and various other nifty tools. They have also open sourced the viewer and config tool, which is very nice! With their product, it looks like you can fairly easily set up syslog-ng (and various other tools) to submit passive checks, which will let you do all of your notifications through one interface. List price for the solution is $16,000 for a single monitoring server, plus $5000 per remote server to pass checks back, or for another server for a HA configuration. The price is an annual fee, and includes unlimited support and upgrades. However, if you do decide to cancel your subscription, you can keep the software.
After reviewing IT Groundwork’s solution, we talked to Heroix about their Longitude platform. The thing that intrigued me the most about their product is the fact that it is “agentless” – it monitors your boxes with standard programs you’ve likely got installed already. For example, Windows is monitored with WMI calls, and UNIX boxes are monitored via SSH. The platform looked trivial to set up, and the reporting options appeared to be very nice (although the people running the demo did have a couple issues with it.. I asked them how to create a simple SLA report showing the responsiveness of a web server over the last 10 days, and they weren’t able to figure out how to do that on the demo server they had set up. Of course, the people doing the demonstration were not the people that would assist on an actual install, so I don’t know how fair of a question it was.. I do think that I saw what was required to set up the report properly, however – it just gives you plenty of power to shoot yourself in the foot.) The biggest question I have with it is how hard it is to do custom checks. I asked many questions about this, and apparently, the only current way to do a custom check is to do some custom hacking on their Java code. They will do it for you (or a charge), or they can train you how to do it (also for a charge). This kind of frightens me — I really do prefer to be able to tweak my own monitoring system however I want, without having to be a Java programmer. I guess I’m spoiled with Nagios, and the ability to write custom checks in the language of your choice, and just tell Nagios what the results of the check were via a return code. In any case, Heroix is also rather expensive (list price, at least) – the base price is $299 per monitored server to just monitor the host itself, and do things like basic HTTP GET’s; if you want to monitor an application with one of their special Application Monitors on that host (such as Apache, Oracle, Exchange, etc), then it’s $599 for that license. For our situation (clusters of servers serving up identical content behind a load balancer – so *lots* of servers, plus a nearly identical setup at a disaster recovery site that needs to be monitored also), that just really isn’t reasonable at all – the cost would shoot into the tens of thousands of dollars in no time. The sales rep we talked to wants to get an idea of our network layout, however, and he’ll see what he can do – it sounds like they do have the ability to tailor pricing for your situation. Unlike the IT Groundwork solution, this is not an annual fee – if you want support (including upgrades), you just pay an extra 18% a year maintenance fee. The people we talked to are also going to ask around about some of my other questions (like the possibility of a Perl API to write custom checks), and should be getting back to me on that next week.
So, what monitoring system will we choose? I have a gut feeling that the answer will probably be the status quo of Nagios and OpenNMS, as the other solutions most likely aren’t going to offer any huge benefits for the price differences (especially when IT Groundwork is giving away the configuration and status viewer components for free!) Only time will tell, however!
You should also check opsview (based on nagios again), completely free, actively developed and commercial support available. See http://www.opsview.org/