At my daytime gig I have come into contact with two applications that really zooms.
One is Splunk and I'm sure the developers of that application drink awesome-sauce for breakfast. It is so awesome I won't even write more about it here because I just don't know what words to use. Suffice it to say that it is a log analysis tool that actually beats 'find, xargs, awk, grep' and all of those.
The other application I've came into contact with is OpenTSDB. OpenTSDB is a time series database, that means you put metrics into it that you want to plot over time. OpenTSDB uses hbase as a database. That is the Apache hadoop database.
OpenTSDB has a web front end for plotting the data points that uses gnuplot. It is rather simplistic as a web front but it is mostly bug free and despite a few little quirks it "just works" exactly like you want it to. It does one thing (send a query to OpenTSDB and plot a PNG-image) and it does it good.
We use OpenTSDB to monitor our servers and applications. We put numbers into it like, how much heap has the server, what is the number of busy threads, how many messages was put on the message queue, what was the response time for each web service call.
It is extremely helpful in monitoring and post-mortem analysis of application behaviour. I mean like, really useful. I can correlate exactly the number of packets the load balancer sends to a certain host at 5s intervals with the number of busy threads, the heap size, the cpu load and a load of application specific metrics extracted from the JVMs using JMX.
The JMX-collector, developed in house, is written by some pretty clever guys to be fast but you can also write one, it isn't that hard. Remember to do everything async. and use caching and you're good to go.
I just can not stress enough how incredibly useful OpenTSDB is for not only monitoring what happens now but what happened that sunday when a couple of servers want haywire and didn't respond. Given the precision of the correlation it is very easy to find the relevant log entries from the time stamps.
To put in to perspective how good opentsdb & hbase is. We dump I would say more than 1 metric/second (usually we poll a specific metric each 5 sec or so - depending on what it is).
From each server, and we have > 35 servers.
To a single opentsdb + hbase server.
And it just freaking works. I can actually see exactly what the heap size was for host X on christmas eve.
Just tonight I installed OpenTSDB on my local machine/server, I don't need it at home but just to pay tribute to it.